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ABSTRACT 


These  notes  provide  an  Introduction  to  the  class  of  single-instruction, 
multiple-data  stream  computers  with  the  simplest  processing  elements.  Design 
principles  are  explained  in  terms  of  hypothetical  Distributed  Processor  Arrays, 
with  examples  drawn  from  experimental  systema.  Emphasis  Is  placed  on  (a) 
mlnladslng  the  cost  differential  when  the  DFA  Is  compared  with  conventional 
main  storage,  and  (b)  designing  the  array  control  unit  to  support  advanced 
forms  of  protection  and  language  Implementation.  The  Influence  of  the  DPA 
on  general  system  design  Is  exasdned  briefly. 


The  work  described  herein  was  supported  In  part  by  the  Joint  Services 
Electronics  Progrem  under  Contract  Mo.  N00014-7 5-0601. 
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1 DISTRIBUTED  PROCESSOR  ARRAYS 


These  lectures  are  concerned  with  assemblies  of  processors,  each 
having  local  arithmetic  and  local  storage  capability  but  sharing  a common 
control  unit,  so  that  they  execute  synchronously  a broadcast  stream  of 
instructions.  When  good  (In  terms  of  performance)  such  arrays  can  be  ex- 
pected to  be  very,  very  good,  but  when  they  are  bad  they  are  unproductive 
if  not  exactly  horrid.  Thus  one  of  the  main  objects  of  system  design  is 
to  keep  the  array  working  on  problems  at  the  favourable  end  of  the  spectrum 
for  a sufficient  proportion  of  time  to  justify  its  cost.  In  the  Illlac  IV 
machine  an  attempt  has  been  made  to  achieve  that  end  by  covering  an  exten- 
sive user  catchment  area  by  using  a communications  network,  enabling  remote 
sites  to  send  problems  to  the  array:  there  are  very  few  laboratories  or 
businesses  that  present  a continuous  workload  with  the  high  degree  of 
parallelism  necessary  for  efficient  working.  An  alternative  approach  is 


to  reduce  the  cost  attributable  to  the  array  to  the  extent  that  it  can  | 

operate  economically  at  a low  duty  cycle.  In  principle,  one  would  like  | 

to  see  a TZ  improvement  in  system  throughput  for  an  investment  of  sub-  ■ 

stantially  less  than  TZ,  but  in  practice  we  shall  see  that  the  presence  of 
an  array  can  affect  system  design  in  ways  which  are  impossible  to  quantify. 

Readers  unacquainted  with  array  processor  designs  will  gain  some 
insight  from  early  papers  on  Solomon  [1]  and  Illlac  IV  [2].  One  of  the 
main  parfomance  bottlenecks  on  conventional  systems  is  the  data  path  from 
processor  (or  processors)  to  . program  storage  units:  with  random  access  to 

g 

words  the  fastest  crossbar  or  bus  systems  achieve  peak  rates  of  about  10 
bytes/saeond.  There  are  two  methods  of  exceeding  this  limit,  both  of 
which  assuM  non-random  access  patterns: 

<«) 


Use  vector  addressing  modes,  which  allow  retrieval  and  storage 
of  word  sequences  regularly  spaced  in  store.  It  is  then  neces- 
sary to  transmit  only  a fraction  of  the  addresses, and  speed 


-A- 


gains  can  be  achieved  by  interleaving  store  cycles.  The  COC  Star, 
TI  ASC  and  Cray-1  machines  provide  examples  of  this  approach  which 

9 

leads  to  maximum  data  rates  in  the  region  of  10  bytes/second. 


(b)  Use  a new  pattern  of  physical  Interconnection  In  which  each  pro- 

cessor has  direct  access  to  only  a limited  region  of  storage.  Thus 
Illiac  IV  with  64  local  stores  containing  8 byte  words  and  cycling 

9 

at  240nsec  would  achieve  a maximum  data  rate  of  about  2*10  bytes/ 
second.  It  will  be  seen  later  that  practical  data  processing 
rates  exceeding  10^^  bytes/second  can  be  envisaged  with  currently 
available  technology,  at  the  expense  of  severe  limitations  on 
accessibility. 

Of  course,  maximum  data  rate  Is  not  the  end  of  the  story,  and  although 
it  is  true  that  data  access  patterns  are  non-random  they  vary  appreciably 
from  one  class  of  problem  to  another,  allowing  various  degrading  factors 
to  come  Into  play.  We  should  note  In  passing  that  the  class  of  problems 
of  iasMdiate  Interest  are  charcterlsed  by  regular  data  spacing,  which  is 
not  well  served  by  slave  memory  techniques.  On  the  other  hand  slave  stores 
deal  with  the  type  of  non-randomness  which  appears  as  repeated  reference  to 
the  same  locality,  so  the  two  mechanisms  are  not  competing  for  the  same 
class  of  problem  and  we  may  hope  to  see  them  working  effectively  In  com- 
bination in  some  future  system. 


[ 

! 

i 


For  practical  purposes  only  linear  or  rectangular  arrays  need  be 
considered.  Hsndllng  three  dlamnsional  arrays  Is  severely  llailted  by  the 
plsnar  form  of  hardware,  which  Is  unlikely  to  change  until  radically  new 
methods  of  manufacture  are  proven.  The  form  of  Interconnection  within  a 
plane  Is  more  open  to  debate.  Many  problems  srs  naturally  expressed  In 
polar  form,  or  by  using  s hexagonal  cell  pattern  rather  than  square;  the 
result  of  mspplng  them  into  a rectangular  array  Is  to  lesve  some  of  the 
I processors  snd  connection  paths  unused.  However,  In  the  light  of  exper- 

t lence  gslned  so  far  the  square  array  with  four  near-neighbour  connections 

sppesrs  to  be  most  widely  sppllcsble. 

The  msln  sppllcstlon  Incentive  derives  from  the  nianerlcal  solution 
of  field  equations  such  ss  thoss  occurring  In  reactor  physics  and  metsor- 

i 
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ology.  In  which  method  (b)  is  not  unduly  restrictive.  Moreover,  most  I 

numerical  methods  are  based  on  a discrete  representation  of  physical  space 

and  time  that  can  be  mapped  directly  onto  the  array,  the  local  storage 

providing  the  third  and  fourth  dimensions  when  necessary.  Problems  of 

this  class  create  a practically  unlimited  demand  for  computing  power  in 

the  quest  for  high  resolution  and  there  is  no  doubt  that  very  high  absolute 

performance  and  high  performance/cost  are  attainable  using  array  techniques. 

The  principles  of  such  applications  are  outlined  below,  but  the  emphasis  of 
discussion  is  on  non-numerical  problems,  system  and  language  design. 

The  main  engineering  stimulus  comes  from  the  emergence  of  semi- 
conductor atores  as  main  memory  components:  having  the  same  physical  and 
electrical  properties  ss  logical  devices  It  Is  far  easier  to  consider 
closely-coupled  assesiblies  than  in  the  days  of  core  memory.  Possible 
applications  of  this  principle  to  cache  memories  have  been  pointed  out 
by  Stone  [3].  A related  factor  of  extreme  Importance  Is  that  simple  and 
highly  repetitive  circuits  are  very  suitable  for  LSI  manufacture:  some 
of  the  processors  to  be  considered  require  less  than  100  logic  gates  each 
and  could  eventually  represent  a negligible  cost  increase.  Simplicity  is 
the  consequence  of  using  single-bit  wide  dsts  paths  snd  providing  a primitive 
instruction  set.  The  complex  functions  that  are  needed  for  arithmetic  and 
data  manipulation  reside  in  the  form  of  stored  program  in  the  same  way  as 
microcode  for  a conventional  machine.  It  Is  possible  that  reduced  hardware 
costs  %rlll  allow  us  to  regsrd  s processor  of  1000  gates  as  ’negligible*  at 
some  future  date,  at  which  time  commitment  to  greater  functionality  or 
wider  data  paths  can  be  considered  as  an  alternative  to  closer  packing  of 
single  bit  processors.  The  main  requirement  at  present,  however.  Is  to 
understand  the  trade-offs  well  enough  to  make  sensible  decisions,  and  the  j 

study  of  single-bit  processors  seems  to  be  the  best  starting  point  for  that. 

The  reader  may  recognise  the  resenblsnce  to  cellular  arrays  [4], 

[5],  whose  study  Is  prompted  by  the  seme  technological  projections.  The 
main  difference  Is  that  the  logic  and  data  paths  are  thought  of  as  fixed 
In  the  processor  array  and  variable  (often  on  a row  or  column  basis)  in  i 

many  cellular  designs.  There  Is  no  Intrinsic  reason  for  maintaining  the  I 

distinction.  Further ' resesrch  nay  show  effective  wsys  of  coodtlnlng  the  j 

two  lines  of  development.  | 


1.1 


DPA  storage 


GGCC] 
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TIm  array  proeaaaor  attachas  a sawll  computer  or  processing 
clcaent(PK)  to  each  storage  element.  It  la  In  that  sense  that  processing 
power  is  'distributed*  through  the  store.  The  control  logic  normally 
associated  with  the  store  module  for  handling  addresses  and  data  is  elabo- 
rated to  form  instructions  that  are  broadcast  in  synchronism  to  every  PE 


Figure  1: 


Storage  module  layout  for  DPA4.1 


A store  module  can  be  thought  of  as  a rectangular  array  of 
(semiconductor)  storage  elements  whose  dimensions  are  determined  as  a 
multiple  of  the  store  data  word  size  and  a power  of  two.  For  illustra- 
tion a 'toy'  store  of  16-blt  words  in  16  rows  will  be  used  and  I shall 
denote  by  'DPAn'  where  n is  in  the  range  1 to  9 a square  array  with  sides 
of  length  2".  Thus  the  toy  array  is  referred  to  as  DPA4.  The  practically 
useful  arrays  in  terms  of  computing  power  require  n at  least  6.  Parity 
and/or  tag  bits  are  assumed  to  be  present  above  the  nominal  word  size,  but 
they  do  not  take  part  in  the  array  processing  activities  and  they  will  be 
omitted  from  the  following  description.  Figure  1 shows  a plan  view  of  the 
store  module  and  a '3D'  view  with  the  storage  bits  extended  in  the  vertical 
direction  to  show  the  layout  of  words  in  horizontal  sections  through  each 
row  of  storage  elements.  Each  storage  element  contains  a few  thousand 
binary  digits:  I shall  refer  to  an  array  with  m kbit  stores  as  'DPAn.m'. 
Thus  the  toy  array,  which  has  1024  bits  in  each  element,  is  0PA4.1 


(0,0) 


(0,15) 


(0,0) 


(15,0). 


WORD  IN 


.(15,15)' 


(15,0) 


(15,15) 


PLANES 


In  the  array:  each  PE  must  obey  the  Instruction,  the  only  option  that  can 
be  exercised  locally  Is  whether  to  store  the  result.  Figure  2 shows  a plan 
of  the  PE  array  and  a perspective  view  with  the  processors  In  the  lid  of 
the  store.  In  programming  terms  we  shall  see  that  data  Is  processed  by 
passing  it  'vertically'  through  the  lid  of  the  store  2^"  bits  at  a time 
from  a selected  bit  plane. 


DATA 


BIT 
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Storage  module  with  processing  elements  DPA4.1 


Figure  2 


Each  PE  has  single  digit  data  connection  to  neighbours  In  four 
directions  designated  N,  E,  S,  and  V.  Elements  at  the  edges  of  the  array 
are  always  short  of  one  or  two  neighbours.  The  Input  at  these  points  can 
be  selected  by  progrem  to  be  (e)  always  aero;  (b)  taken  from  the  other  end 
of  the  seme  row  or  column  to  give  cylindrical  or  toroidal  geometry;  (c) 

• 

teken  from  the  other  end  of  the  next  row  (or  coltaui)  to  form  a linear  or 
circular  array  of  2 PEs.  In  each  geometry  the  thickness  of  the  surface 
or  line  Is  determined  by  the  local  store  else,  eg  1024  bits  In  the  toy 
machine. 
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Example 

The  choice  of  PE  connections  Is  a compromise  between  pin  limitations 
Inherent  In  the  form  of  construction  and  problem  requirements  that 
have  become  apparent  in  application  studies.  The  consequence  of 
omitting  a connection  is  that  data  has  to  be  routed  1 or  2"  cells 
at  a time  across  the  array  for  as  many  bits  as  there  are  In  the  data. 
The  result  can  best  be  Illustrated  by  considering  a problem  that  Is 
not  square  to  start  with,  namely  the  mapping  of  a global  coordinate 
system  with  fewer  grid  points  at  the  poles  than  the  equator.  Figure 
3s  shows  that  direct  mapping  onto  a 24*16  array  (Fig. 3b)  would  leave 
a substantial  number  of  PEs  unused.  One  possible  treatment  is  to  wrap 
the  map  round  a cylinder  so  that  the  unused  PEs  near  the  N pole  mesh 
with  used  PEs  from  the  S pole.  However,  It  can  be  seen  that  neigh- 
bouring points  near  the  poles  on  the  global  map  would  still  be  some- 
what distant  In  the  array,  and  the  situation  can  be  Improved  If  the 
N and  S hemispheres  are  first  displaced  W and  E respectively  and  then 
wrapped  round  the  cylinder  (Fig. 3c).  With  the  DPA4  array  the  PE 
utilisation  is  nearer  90%  and  could  be  increased  on  a larger  gi^ld. 

It  will  be  noted,  however,  that  high  utilisation  has  been  achieved 
by  matching  the  array  size  to  the  problem,  which  Is  not  always 
possible.  Also  note  that  neighbours  on  the  global  map  are  not 
always  neighbours  In  the  array,  time  being  lost  In  routing  numerical 
values  across  up  to  four  columns. 

We  can  readily  Identify  three  factors  which  prevent  the  DPA  from 
achieving  the  theoretical  data  processing  rate  of  2^^*  bits  per  Instruction: 


(a) 

(b) 

(c) 


mapping  which  prevents  a problem  from  being  cast  into  a form  to 
use  all  PEs; 

routing  which  occupies  the  array  In  unproductive  data  movements ; 
branching  which  causes  only  a fraction  of  the  PEs  to  be  active 
during  a particular  phase  of  calculation,  eg  in  the  preceding 
example  the  poles  may  require  special  calculations  which  could 

g 

use  only  of  the  available  PEs. 


I 


Experience  shows  that  Intuitive  Judgements  based  on  habits  formed  in  using 
conventional  machines  can  be  quite  far  from  the  truth,  particularly  in  the 
area  of  (a)  and  (b) . 
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1.2  Array  operations 

The  Array  Control  Unit  (Figure  2)  is  responsible  for  broadcasting 
instructions  to  the  processing  elements,  providing  registers  to  buffer  the 
data  sent  and  received  on  the  common  row  and  column  lines,  and  for  serving 
external  requests  for  access  to  the  memory.  The  type  of  external  request 
depends  on  the  part  played  by  the  DPA  in  the  system  context,  which  1 shall 
return  to  in  the  later  lectures.  In  this  subsection  and  the  next  we  consider 
the  elementary  functions  of  the  array  and  the  programs  executed  by  the  ACU. 

It  is  assumed  that  a DPAn  is  controlled  by  an  ACU  with  internal  data  width 
2*^,  eg  the  DPAA  is  associated  with  16  bit  data  fields  in  the  ACU,  which  are 
placed  in  correspondence  with  the  row  and  colomn  data  lines:  otherwise  the 
description  would  have  to  provide  operations  to  align  part-words  with  the 
array  edges,  or  conversely. 


Let  Y be  a register  in  the  ACU  with  bits  Y , and  let  A.  , be  a 

i ^ » J 

bit  in  the  (i,j)th  PE.  Then  four  input  functions  are  defined  as  follows: 


(1) 

(2) 

(3) 

(A) 


Input  by  row: 

A^  j “ Y^  for  all  (i,j) 

Input  by  column: 


j - Yj  for  all  (i,J) 

Input  by  row  with  column  select: 

Vj  “ "^1  forallljA^j 

Input  by  column  with  row  select: 

*1.1  ■ 'l 


unchanged  if  j J 
unchanged  If  1 5*  I 


Thus  in  cases  (1)  and  (2)  a bit  of  Y is  broadcast  to  all  PEs  along  a row  or 

column  of  the  array.  Case  (4)  corresponds  to  store  write  when  the  selected 

bit  A is  in  the  local  memory  of  each  PE. 
if  J 

Corresponding  to  input  functions  there  is  a set  of  four  output 
functions,  with  and  without  selection  by  row  or  column.  Here  the  resultant 
bit  in  the  Y register  is  the  logical  and  of  all  selected  PEs  on  the  data  line 
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(1)'  Output  by  row: 

The  output  operation  'by  column  with  row  select'  corresponds  to  normal  store 

read  when  the  A.  . Is  In  local  memory.  The  method  of  choosing  the  A and  Y 
*»J 

words  Is  discussed  later. 

la  its  simplest  form  the  PE  contains  three  single-bit  registers 
with  the  following  uses: 

A Is  the  activity  register.  When  zero,  writing  to  local  stores 
can  be  Inhibited; 

B Is  the  arithmetic  and  logical  accumulator; 

C is  the  carry  digit. 

The  other  components  of  the  PE  are  a routing  multiplexor,  which  is  used  to 
select  Input  to  local  store  from  the  A or  B registers,  near  neighbours  (N, 

E,  S,  W) , or  the  common  row  (R)  or  coluinn(C)  data  lines,  and  the  local  memory 
Itself.  An  Inverter  (1)  allows  the  polarity  of  data  to  be  reversed  In  going 


ROW  & COLUIIN 


Inve 


Uiy^tore 

I activity 


j B i Control  (function) 

IN  S k K ~C~  External  data 

-X I I ■ * * « I 

Control (routing) 


Address 


TO 
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Figure  4:  Processing  element  schematic  for  the  DPA 


from  memory  Into  the  A or  B registers,  and  the  or_  gate  SEL  allows  program 
control  of  the  use  of  A to  Inhibit  'store'  operations. 


I 
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The  PE  carries  out  two  types  of  operation:  functions  of  the 
arithmetic  registers  A,B  and  C;  and  routing  of  data.  Each  uses  the  broadcast 
local  memory  address,  which  Is  the  same  for  all  PEs,  so  that  the  operation 
Is  carried  out  on  a 'bit  plane*  In  the  DPA.  An  address  Is  specified  by  an 
ACU  register  which  for  security  reasons  also  contains  a limit  field  giving 
the  maximum  modifiable  range  of  that  address.  To  address  the  DPA4  we 
take  a 16  bit  byte  location,  a 12  bit  limit  and  a 4 bit  tag  field.  The  most 
significant  11  bits  of  the  location  select  a bit  plane,  the  remainder  specify 
column  or  row  select  when  necessary  (4  bits)  and  the  least  significant  bit 
selects  even  or  odd  byte. 


ACU  address:  [tAg|  LIMIT  | LOCATION  | 

— ^ 

bit' PLANE  R0H/COL  BYTE 
ACU  data:  |tAG|  VALUE  j 

ACU  control : |tAg(  MODE  ( LOCATION  | 

ACU  Instruction:  | f | X | ^ or  | f j X | N | 


|tag| 

VALUE  1 

TAG  MODE 

LOCATION  1 

The  ACU  data  Is  tagged  to  distinguish  It  from  addresses,  the  result  of  an 
array  output  operation  Is  tagged  as  'data',  and  similarly  the  operand  of 
an  Input  operation  must  be  'data'.  An  ACU  program  control  pointer  has  a 
similar  form  to  an  address,  without  the  LIMIT  field  but  Including  MODE  bits 
which  specify  the  geometric  connections  as  well  as  conventional  arithmetic 
and  control  modes. 


The  ACU  Instruction  Is  uniformly  16  bits  In  width,  giving 
a function  field  f and  either  two  4-blt  register  addresses  X,  Y or  a 
single  register  and  a literal  field  N.  The  Instruction  set  covers  the 
requirements  of  sequential  operations  In  the  ACU  Itself  and  parallel  oper- 
ations In  the  DPA  of  the  two  types  mentioned  above. 


A 


DPA  ARITHMETIC 

The  X-reglster  address  Is  used  to  select  {.  local  data  value  x In 
each  PE.  The  following  functions  are  available: 

LPA  Load  A Sets  A ■■  x 

LDB,LDC  Load  B,C  Set  B - x,  C ■ x respectively 

ADD  Add  to  acc.  Forms  the  sum  of  B,x  and  C In  B 

and  forms  the  carry  In  C 

AND  AND  to  acc.  Forms  the  logical  and  of  B and  x In  B 

OR  OR  to  acc.  Forms  the  logical  or  of  B and  x In  B 

EQU  EQU  to  acc.  Forms  the  logical  equivalence  of  B 

and  X In  B 

In  any  arithmetic  function  the  datum  can  optionally  be  Inverted. 


DPA  ROUTING 

The  X-reglster  address  Is  used  to  select  a destination  plane 
or  In  some  cases  the  source.  The  following  functions  are  available: 

IR  Input  by  row  Uses  Y to  provide  data  input  to  x 

according  to  (1)  on  page  10 
IC  Input  by  column  See  (2)  on  page  10 

IRC  Input  by  row  with  column  select  (See  (3)) 

ICR  Input  by  column  with  row  select  (See  (4)) 

AOR  AND  output  by  row  (See  (1)') 

AOC  AND  output  by  column  (2) ' 

ORC  Output  by  row  with  column  select  (3) ' 

OCR  Output  by  column  with  row  select  (4) ' 

Note  that  ICR  corresponds  to  conventional  STORE  and  OCR  to  LOAD  functions. 

MVN  Hove  North  The  datum  plane  Is  moved  north  one  PE 

MVS,  MVE,  MVH  Similarly  for  south, east  aadc'west. 

Note  that  MVE,  MVW  correspond  to  single  bit  word  shifts  In  the  array,  the 
'most  significant*  or  'left*  end  of  a word  assumed  to  be  on  the  W edge. 

STA  Store  A Sets  x ■ A 

STB  Store  B Sets  x ■ B 

In  any  routing  function  the  store  aAlon  Is  by  default  conditional  on  the 
value  of  A * 1 In  the  destination  PE;  it  Is  possible  to  override  A in  any 
Instruction  by  following  the  function  with  "/U"  as  in: 

eg  STB/U  Store  B unconditional  , ie  independent  of  the  value  of  A 
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1.3  The  array  control  unit 


The  ACU  obeys  Instructions  fetched  from  the  DPA.  Its  operations 
Include  those  listed  in  the  previous  subsection  together  with  conventional 
control,  arithmetic,  logical  and  addressing  functions.  The  ACU  is  classed 
as  a Polnter-Mumber  machine,  le  one  In  which  a distinction  is  drawn  between 
pointers  (control  and  data  addresses,  codewords  or  capabilities)  and 
numbers.  Functions  arc  provided  to  create  and  manipulate  varlotia  classes 
of  objects,  tha  Instruction  set  being  designed  to  prevent  abuse  In  the 
aeiasc  of  damaging  Integrity  or  gaining  access  to  objects  without  permission. 
However,  only  one  aspect  of  the  overall  design  need  concern  us  here  and 
that  Is  the  use  of  addresses  to  refer  to  bytes,  words  or  bit  planes  in  the 
DPA. 

The  form  of  data  address  is  given  on  p.l2.  The  12-bit  limit; 
field  enables  an  address  to  refer  to  up  to  128  consecutive  bit  planes.  The 
low  order  5 location  bits  are  used  in  byte  and  word  access,  including  row 
and  column  selection.  To  simplify  the  addressing  rules  when  working  with 
array  data  we  declare  that  protection  of  binary  segments  Is  only  resolved 
to  the  bit  plane  boundaries. 


An  ACU  program  will  be  written  as  a sequence  of  statements  with 
elementary  IF,  GOTO,  vmiLE  and  DO  control  clauses.  Functions  of  the  ALU 
are  expressed  by  the  following  operators: 


ARITHMETIC  AHD  LOGIC 

syntax 

symbol 

function 

operand 

example 

binary 

infix 

+ 

add 

types 

H,  M 

X + y 

binary 

infix 

- 

subtract 

N,  N 

X - y 

unary 

prefix 

- 

negate 

N 

-a 

binary 

infix 

* 

integer  mpy 

H,  N 

A * p 

binary 

infix 

& 

logical  and 

M,  H 

X & IFF 

binary 

Infix 

/ 

logical  or 

M,  N 

X / X 

binary 

infix 

X 

logical  neq 

N,  N 

a Z b 

binary 

infix 

« 

left  shift 

N,  H 

p«  3 

binary 

infix 

y> 

right  shift 

M,  N 

e»b 

ADDRESSING 

binary 

infix 

t 

modification 

A,  N 

a ' 8 

binary 

infix 

t 

limitation 

A,  N 

b + n 

unary  postfix 

s 

load 

A 

X. 

ASSIGNMENT 

binary  infix  ■ register  transfer  x ■ y 

binary  infix  store  A,-  x y 

In  each  case  the  operands  are  ACU  registers  that  will  be  declared  as 
required,  or  literals  where  numeric  arguments  are  allowed.  Expressions 
arc  evaluated  from  left  to  right,  addressing  operations  taking  precedence 
over  arithmetic  and  assignment.  Where  no  assignment  operator  is  present 
and  the  first  operand  is  a register  the  result  overwrites  the  register  as 
in  the  statement  "x'l",  which  modlfes  the  register  x by  1. 

The  usual  conditions  are  set  by  arithmetic  and  logical  operations 
and  tested  in  control  clauses  (NZ,  ZE,  GT,  GE,  LT,  LE,  OV,  NV) . The 
addressing  functions  produce  invalid  (null)  results  in  the  event  of  pro- 
tection violation  and  set  the  condition  IR  with  Inverse  VR  (valid  result) . 
Arithmetic  and  control  functions  fail  if  an  operand  of  the  incorrect  type 
is  presented.  In  most  of  the  examples  given  below  such  exceptions  are 
assumed  not  to  occur.  The  protection  rules  simply  ensure  that  a program 
does  not  cause  damage  outside  the  protection  domain  defined  by  the  ACU 
registers  when  a programming  error  occurs. 

DPA  functions  will  be  expressed  using  as  prefix  or  infix 
operators  the  mnemonics  given  in  the  previous  subsection.  Arithmetic 
functions  require  one  argument  (an  expression  giving  the  address  of  a bit 
plane),  which  will  be  preceded  by  when  inverting  the  input  to  the  PE. 
Store  functions  require  one  argument  giving  the  address  of  a bit  plane. 

Array  input  requires  a destination  (bit  plane  address)  and  source  (data 
register).  Output  requires  a destination  register  and  source.  Finally, 
move  operations  require  a bit  plane  address  and  step  count  (data). 

The  example  on  the  following  page  illustrates  the  conventions 
used  in  writing  ACU  programs.  It  is  assumed  that  the  ACU  supports  a 
procedure  calling  mechanism  so  that  the  function  in  the  example  would  be 
called  as  ''SUBTRACT(P,  Q)",  where  P and  Q specify  operands.  The  program 
will  abort  if  Q is  longer  than  P and  set  OV  if  overflow  occurs  anywhere 
in  the  array,  the  coamwn  plane  OFLOW  indicating  which  elements  overflowed. 
The  mechanism  of  procedure  call  and  module  interconnection  will  be  examined 
in  a later  lecture  because  it  is  affected  by  the  presence  of  the  DPA. 


I /*  Ex«ipl«: 

I The  following  prograa  ••gnene  aubtrncts  two  arrays  of  (IN-1) 

I bit  2'a  coaplnaant  Intagars  stored  in  vertical  fom.  The 

f**ult  ASGl  - ARC2  la  stored  In  HES  In  all  active  PEs.  The 
ACU  OV  condition  la  set  If  any  result  overflows.  The  boolean 
I aatrlx  OFLOU  Is  set  ■ 0 wherever  overflow  has  occurred  In  an 

active  PE.  The  argusMnts  are  specified  by  bit  plane  addresses 
' with  least  significant  digits  In  the  high  address  plane  */ 

I EECISTBES  ( ASCI  ASG2  RES  P OFLGU  ] 

/*  Set  carry  In  and  Initialise  overflow  plane  */  ! 

OFLOW  IR/U  -1  ; LDC  OFLOW;  P«5 

/*  Subtraction  loop  */  ^ 

WHILE  GE  DO  (LDB  ASG'P;  ADD  -ARG2'P;  STB  SES'P;  P-32)  | 

/*  Set  overflow  plane  to  zero  wherever  overflow  has  occurred  */ 

' ' LDB  ARGl;  ADD  -ARC2;  EQU  RES;  STB  OFLOW; 

P ADR  OFLOW;  IF  (P  4 -1)  (SETOV);  RETDRN 

« : 

( 

t 


ELEMENTARY  DPA  PROCEDURES 


Having  shown  In  princlpla  how  a DPA  ia  controlled  we  can  examine 
its  application  to  some  frequently  occurring  tasks.  The  object  Is  to  obtain 
theoretical  performance  limits,  taking  Into  account  PE  utilisation,  routing 
and  branching.  Obviously  there  Is  no  point  In  pursuing  an  application  un- 
leis  It  offers  substantial  returns  on  that  basis. 


A fourth  degrading  factor  has  to  be  added  to  those  listed  on 
p.8:  the  time  taken  to  supply  the  ACU  with  Instructions  and  the  time  taken 
by  the  ACU  In  modifying  address  counters,  testing  for  loop  termination,  etc. 
In  which  potential  array  functions  are  'lost*.  There  are  several  ways  of 
minimising  the  loss.  Including  Instruction  buffering  and  overlap,  but  I do 
not  propose  to  discuss  them  here  and  shall  assume  Instead  a moderate  time 
of  DPA  execution  (200nsec)  In  which  allowance  has  been  made  for  the  effect 
Just  mentioned.  On  that  basis  the  subtraction  example  given  at  the  end  of 
the  first  lecture  requires: 

3P  + 10 

DPA  cycles,  from  which  It  can  be  seen  that  If  the  precision  Is  large  (say 
greater  than  20)  the  'end  effects'  arc  negligible,  but  If  It  Is  small,  as  Is 
frequently  the  case,  we  should  be  looking  for  better  ways  tf  setting  carry 
and  testing  overflow.  However,  let  me  emphasise  the  Importsnce  of  evaluating 
any  such  Improvement  In  terms  of  its  contribution  to  overall  system  through- 
put rather  than  to  individual  procedures. 


The  procedures  arc  classified  as  'arithmetic  and  logic',  which  are  mainly 
concerned  with  operations  within  a PE  or  row  of  PEs  without  regard  to 
neighbours;  'routing',  which  are  concerned  with  preparing  arrays  for  para- 
llel arithoMtlc;  and  'matrix',  which  coBd>lne  the  first  two.  The  objective 
of  a more  complete  study  would  be  to  provide  a sat  of  arlthmatlc  and  data 
manipulative  function  that  can  be  used  by  application  programmers  and  com- 
pilers in  generating  array  code  and,  perhaps  more  valuable  In  the  long  run 
to  develop  (kcincuitlvc  understanding  of  the  array  which  Is  essential  to 
successful  •yj*t«**  analysis. 
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2.1  Arithmetic  and  logical  operations 

There  are  two  word  orientations  of  Importance  In  DPA  operations: 
horlsontal  and  vertical.  The  former  corresponds  to  the  conventional  store 
layout,  so  that  data  written  as  words  by  the  ACU  can  be  processed  In  situ 
by  the  array.  The  latter  form,  which  waa  used  In  the  example  of  subtraction, 
requires  the  data  words  to  be  stored  In  consecutive  bit  planes,  one  word  to 
each  PE.  The  DPA4  can  process  16  words  of  16  bits  In  horlsontal  form,  or 
256  words  one  bit  at  a time  vertically.  The  distinction  Is  less  lBq>ortant 
In  logical  operations,  the  bit  processing  rate  being  the  seise  In  either ^case, 
than  for  arithmetic.  In  which  provision  has  to  be  made  for  carry  propagation. 

When  dealing  with  large  arrays  of  short  words  the  vertical  form  Is  to  be  | 

preferred  because  It  allows  greater  PE  utilisation,  and  In  certain  special 
functions  such  as  the  manipulation  of  Boolean  arrays  or  sign  digits  it  Is 
about  H (<■2**,  the  word  length  of  the  array)  times  faster  than  horizontal 

and  times  faster  than  sequential  processing  In  the  ACU.  (More  precisely,  | 

such  comparisons  should  read  that  In  the  limit,  for  large  arrays,  the  ratio 
2 

la  k*M  or  k*M  where  k Is  a small  constant  factor,  usually  near  unity.) 

In  vertical  node,  carry  Is  propagated  through  the  C register  in 
each  PE.  In  horizontal  sxide,  convention  requires  carries  to  propagate  to 
the  west,  which  would  be  so  time-consuming  In  the  DPA  that  It  would  have 
little  practical  use.  We  shall  see  later  how  additional  routing  and/or 
function  can  be  used  to  achieve  competitive  speeds  In  horizontal  mode. 

When  summing  a large  number  of  word  planes  the  DPA  Is  placed  at  less  of  a 
disadvantage  by  using  carry-save  techniques.  For  example,  horizontal 
multiplication  In  DPA4  requires  the  summation  In  each  row  of  PEs  of  16 
expressions  of  the  form: 

b^  * 2^  for  J ■ 0 to  15 

1-0  ^ 

where  b^  Is  the  1th  bit  of  the  Jth  word,  which  occurs  In  the  1th  PE.  The 

product  can  be  formed  by  summing  vertically  to  give  the  non-standard  result 
I 15  . 15  . 

F -X  c.  • 2^  where  c.  « X 

imO  ^ ^ i-0  ^ 

^•ch  c^  Is  a four-bit  carry  which  can  be  propagated  by  KVW,  followed  by 
aummlng  again  vertically.  The  final  addition,  which  completes  the  carry 
I propagation.  Is  probably  best  done  In  the  ACU.  (This  Is  one  of  the  appll- 
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cations  in  which  the  end-effects  on  addition  become  important.) 


The  low  level  of  coding  allows  advantage  to  be  taken  of  special 
properties  of  the  data  in  many  instances.  Multiplication  by  a constant,  for 
example,  is  faster  than  array  multiplication  in  all  cases  because  it  can 
take  advantage  of  strings  of  zeros  or  ones  in  the  multiplier.  Iterative 
calculations  such  as  square  root  can  use  a low  precision  approximation  in 
the  early  stages.  Note  also  that  in  squaring  operations  the  coefficient 
b|  which  is  in  fact  b^.bj_^,where  the  b^  are  the  digits  of  the  operand,  also 
occurs  as  b^  , . Therefore  c . can  be  computed  as : 


'J-1 


(i  even) 
(i  odd) 


^ '’i-1‘^1 

c^  - 2*(b^.bQ  + b^_j^.b^  + 


’’l/2-l‘*’i/2-l^'^i/2‘^i/2 

^(i-l)/2*'’(l+l)/2^ 


which  halves  the  number  of  partial  products. 


In  general,  vertical  multiplication  of  two  p-blt  numbers  requires 

2 

p additions  to  give  a 2p  bit  result,  or  3p  basic  cycles.  A p-blc  result 
2 

requires  3p  /2  basic  cycles  but  slightly  more  organisation.  Division,  using 

a restoring  algorithm,  produces  a p-blt  quotient  from  a 2p-blt  dividend  and 
2 

p-blt  divisor  in  6p  basic  cycles. 

In  floating  point  addition  and  subtraction  the  tlmetaken  to  com- 
pare and  align  operands  outweighs  the  arithmetic  by  a considerable  margin. 

A scaling  operation  takes  two  cycles  per  bit  in  vertical  mode.  Thus,  using 
radix  16  exponent  and  24-bit  mantissa  three  normalising  shifts  are  required 
before  and  after  the  add/subtract,  and  the  equivalent  of  S moves  to  and 
from  workspace,  giving  25  basic  cycles  per  bit  as  opposed  to  three  for  fixed 
point.  There  is  clearly  a great  advantage  in  space  and  time  if  fixed  point 
arrays  of  low  precision  can  be  used . ‘ 

The  following  teble  summarises  the  theoretical  limits  on  vertical 
operations.  Practical  measures  will  be  given  in  the  next  lecture. 
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TABLE  1:  THEORETICAL  BOUNDS  ON  ARITHHETIC  SPEEDS 
All  operand*  In  vertical  fora 


nXED  POINT 

FLOATING  POINT 

ADD/SUBTRACT 

3P 

Uf  6e  -»■  4df 

MULTIPLY 

3p*/2 

3f*/2  + 3e  + 2f 

DIVIDE 

6p* 

MOVE 

pc 

pc 

SCALE 

2p 

2p 

2p 

vniere: 


p la  the  nuabar  of  blta  In  the  operand 

f Is  th*  nuabar  of  bits  In  the  fraction 

a la  th*  nuabar  of  bits  In  the  exponent 

d Is  th*  nuabar  of  danoraallslng  shifts 

c Is  the  distance  aovad  In  rows  -f  coltasiis 
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2 . 2 Data  routing 

The  figures  of  Table  1 Indicate  that  numerical  procedures  will 
be  dominated  by  multiply /divide  and  floating  point  add/subtract  times:  each 
such  operation  requires  at  least  500  DPA  cycles,  or  lOOpsec  on  the  assump- 
tion of  a 200nsec  effective  execution  time.  The  moat  efficient  use  of  the 
array  will  be  achieved  in  two  stages:  problem  analysis,  which  seeks  to 
alnlBise  the  arithmetic  content  and  external  I-O  (which  will  be  examined 
later);  and  detailed  storage  mapping  aimed  at  maximum  PE  utilisation. 
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Data  routing  functions  are  used  to  move  from  one  store  map  to 
another.  Although  complex  at  times  the  Intuitive  feeling  that  routing  will 
dominate  execution  time  quite  often  turns  out  to  be  incorrect.  In  weather 
forecasting,  for  example,  using  spherical  mapping  of  the  type  described 
in  the  first  lecture,  the  routing  overhead  is  estimated  to  be  about  4Z  of 
the  total  execution  tine.  VDien  making  comparison  with  sequential  machines 
it  must  also  be  remembered  that  they  too  sustain  a significant  amount  of 
routing  overhead  In  the  shape  of  register  load  and  store,  shift  and  copy 
instructions.  It  is  Important  to  compare  MOPS,  ie  (millions  of)  useful 
arithmetic  operations  per  second,  rather  than  MIPS,  le  Instructions  executed 
regardless  of  whether  they  do  anything  useful  to  the  outside  observer. 

Movement  within  a bit  plane  and  within  local  storage  use  dis- 
tinct mechanisms,  so  they  will  be  examined  separately.  For  horizontal  data, 
remapping  involves  awvement  in  the  north-south  direction  and  relocation  in 
PE  stores,  while  east-west  moveswnt  is  used  for  scaling.  For  vertical  data, 
scaling  is  effected  by  relocation  in  FE  stores  and  remapping  involves  both 
east-west  and  north-south  shifts.  In  converting  from  horizontal  to  vertical 
fora  (and  vice-versa)  a rotation  procedure  is  used: 

RBGS(  horis  vert  temp] 

DO  (temp  ORC  horis;  vert  IRC/U  te8ip;vert’N;horiz'l)  WHILE  VA 
which  Is  repeated  for  each  p-blt  N-vector.  Thus  mode  conversion  takes  about 
the  same  time  as  multiplication. 


Data  movemant  requires  one  DPA  operation  for  each  row  or  column 
traversed  within  the  plane.  TWo  extreme  exa8q>les  which  are  often  used  as 
benchmarks  are  uniform  shift  or  rotation  in  the  plane  and  arbitrary  permu- 
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tatlon  of  elements.  In  each  case  regarding  the  DPA  as  a linear  array  of  N 
cells. 


Using  cylindrical  geometry  the  average  number  of  operations  for 
a uniform  shift  (assuming  all  equally  likely)  la  N/2.  In  practice,  it  often 
appears  that  not  all  shifts  are  equally  likely:  near-neighbour  connections 
in  the  plane  predominate,  with  power-of-two  shifts  occurring  quite  often. 

The  four  near-neighbours  are  accessed  in  one  cycle,  and  the  eight  near- 
aaighbours  in  1.5  cycles  on  average.  It  is  easy  to  see  that  in  power-of-two 
shifta  the  average  nund>er  of  operation  is  N/n,  le  N/log^N.  In  each  case 
the  operationsnist  be  repeated  p times  for  p-bit  words.  If  the  data  array 
is  larger  than  the  DPA  it  is  necessary  to  use  plane  geometry,  making  the 
edge  connections  through  ACU  registers,  resulting  in  three  DPA  cycles  per 
row  or  column  traversed  per  plane  and  considerably  greater  overhead  in 
control.  There  is  no  significant  advantage  from  using  data  arrays  that 
are  smaller  than  the  DPA. 

Any  permutation  of  elements  can  be  represented  as  a sorting 
problem  by  attaching  a key  to  each  giving  its  (unique)  destination  in  the 
final  listing.  At  first  sight  sorting  in  unattractive  for  an  array  pro- 
cessor because  it  implies  an  irregular  routing  of  elements.  In  a 
sequential  uchlne  the  number  of  comparisons  B(t)  required  to  sort  t items 
by  binary  Insertion  [6]  is  t*log2t  - t + 1 when  t is  a power  of  2,  eg 
B(16)-49,  B(32)'*129.  A 'mlnlmuffl  delay'  parallel  sort  is  shown  in  Figure  5 
for  t ■ 16,  in  idilch  each  horizontal  line  represents  an  item  in  the  list 
and  each  vertical  line  joins  two  items  to  be  compared,  followed  by  an  ’ 
exchange  if  the  lower  element  (in  the  diagram)  is  lower  in  value.  The 
resulting  list  is  in  ascending  order  from  top  to  bottom.  It  can  be  seen 
that  in  moving  from  left  to  right  only  one  or  two  pairs  are  being  compared 
and  that  if  16  processors  were  available  several  successive  stages  could 
be  overlapped.  In  the  example,  only  10  distinct  stages  or  delays  are  used. 

The  DPA  does  not  have  direct  routing  across  several  PEs  as  the 
minimum  delay  sort  assmes.  An  alternative  is  the  odd-even  exchange,  which 
requires  only  neighbours  to  be  compared.  In  the  example,  the  number  of 
stages  is  16,  which  generalises  to  t for  sorting  t items.  In  fact  t is  an 
upper  lisdt  because  the  sort  is  complete  whenever  a comparison  is -not 
followed  by  an  exchange,  is  possible  that  the  total  number  of  exchanges 
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required  in  practice  could  be  reduced  by  occasionally  sorting  in  the  ortho- 

2 

gonal  direction,  by  analogy  with  Shell  sorting:  if  the  N elements  are 
ordered  by  linear  connection  in  the  east-west  direction  that  would  imply 
sorting  north-south  in  order  to  accelerate  progress  towards  the  final 
positions.  I do  not  know  of  any  practical  or  theoretical  studies  of  such 
techniques. 

2 

Applying  the  above  results  to  the  DPA,  we  see  that  N items  of 

2 

p bits  can  be  sorted  in  vertical  mode  in  N + 3p)  cycles,  iidiere  p^  is 

the  precision  of  the  key  (comparison  takes  3 cycles  and  exchange  3 cycles 

per  bit) . Straightforward  insertion,  in  which  a new  list  is  built  up  in  the 

required  order  by  adding  elements  one  at  a time,  requires  a comparison  and 

2 

move  at  each  step,  giving  N 'h  2p)  cycles,  though  it  requires  more  space 

and  does  not  offer  the  prospect  of  early  termination.  Larger  arrays  can 
be  handled  by  storing  adjacent  elements  in  the  same  local  store,  but  the 
exchange  then  takes  4 cycles  and  the  comparison  2 per  bit.  Both  methods 
are  slgnifl'-antly  better  than  binary  insertion,  which  is  dominated  by  the 
time  to  move  and  Insert  the  data  items  rather  than  the  actual  comparisons. 

The  same  techniques  apply  in  horizontal  mode,  though  once  again  the  DPA  is 
at  a disadvantage  without  fast  carry  propagation. 

An  additional  wired  Interconnection  pattern  known  as  a 'shuffle' 
has  been  proposed  to  assist  routing  operations  [6,p237],  [7]  and  [8].  The 
p shuffle  effects  a permutation  in  which  the  destination  of  any  element  is 

{ defined  by  cyclically  shifting  its  current  address  one  position  to  the  left.  | 

! 2 * 
I It  has  been  shown  that  any  uniform  shift  of  N elements  can  be  realised  in 

I a multiple  of  log2N  shuffle-exchange  steps,  and  that  an  arbitrary  permutation 

■ can  be  achieved  in  time  proportional  to  N.  The  individual  steps  are  more 

complex  than  those  outlined  above:  for  DPA6  the  average  shift  requires  32 

1 

! moves,  le  32  machine  cycles,  or  12  shuffle-exchanges,  each  requiring  4 

I ' 

' cycles.  The  practical  benefit  in  terms  of  the  shifts  and  permutations  most  i 

I frequently  encountered  remains  an  open  question.  The  relevance  to  Fast  | 

I Fourier  Transform  is  examined  in  the  next  lecture  in  connection  with  the 

DAP. 

Evidently  there  are  many  applications  of  DPA  to  sorting  both 
large  and  small  data  sets  and  as  for  sequential  machines  the  eventual 

r- 

? 

i 
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choice  of  algorithmdepends  on  the  characteristics  of  the  data  and  the  way 

it  is  used.  Before  leaving  this  topic  it  is  as  well  to  recall  that  the 

need  for  sorting  must  be  reviewed  at  the  systems  analysis  level.  It  has 

been  stressed  in  the  past  because  of  the  limitations  of  sequential  search 

2 

methods,  but  given  that  a DPA  can  search  N items  in  parallel  there  may  be 

2 

no  point  in  retaining  data  sets  of  less  than  N items  in  sorted  form:  they 
can  be  accessed  in  any  desired  order  The  last  comment  is  particularly 
relevant  where  there  are  multiple  keys  and  the  retrieval  criterion  is  a 
logical  or  arithmetic  function  of  the  keys.  A DPA6  would  carry  out  useful 
searching  operations  at  a rate  exceeding  10^^  bits/second,  which  is  probably 
one  of  its  most  cost-effective  application  areas. 


2.3  Matrix  operations 

Matrices  are  stored  in  either  horizontal  or  vertical  mode:  DPA4 
can  process  a d*16  matrix  in  horizontal  form  (or  d*32  if  the  elements  are 
bytes),  the  local  store  providing  the  second  dimension  d;  it  can  process  a 
16*16  matrix  in  vertical  form,  of  any  precision  up  to  128  bits.  Larger 
matrices  can  be  handled  by  partitioning,  but  as  the  resulting  algorithm  is 
often  expressed  in  terms  of  operations  on  d*16  or  16*16  matrices  that  is 
usually  best  done  by  creating  structures  of  three  or  more  dimensions  the 
local  store  providing  the  third  and  higher  dimensions.  A DPAA.l  would 
contain  up  to  32  matrices  of  16*16  elements  in  single  precision,  32-bit 
form. 


Although  the  processing  rate  in  either  mode  is  theoretically 
about  the  same  (with  suitable  arrangements  for  carry) , vertical  mode  offers 
variable  precision  and  indexing  flexibility  that  does  not  exist  for  horiz- 
ontal. One  of  the  disadvantages  of  single  bit  PEs  in  comparison  with 
machines  such  as  Illlac  IV  is  that  It  is  uneconomical  to  provide  local 
store  indexing,  but  that  can  be  overcome  as  explained  below  by  using  pro- 
jection operations.  In  general,  the  horizontal  form  is  attractive  in  DPAn 
for  'vectorial'  problems  of  fairly  low  precision  (up  to  2*')  or  where  frequent 
word  access  by  the  ACU  is  Implied. 


For  the  remainder  of  this  subsection  we  assume  that  vertical 
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r* 


data  Is  used,  and  unless  otherwise  specified  a matrix  Is  taken  to  have 
indices  running  from  0 to  N-1  (“2”-l) , the  coordinate  axes  being  1 (north- 
south)  and  J (west-east).  The  precision,  or  the  number  of  bits  in  each 
element,  is  given  by  the  limit  field  of  the  matrix  address,  plus  1.  If 
the  limit  is  zero  there  is  only  one  bit  and  the  matrix  is  said  to  be 
'boolean' . 


0 1 2 N-1 


One  method  of  sorting  not  touched  on  in  the  previous  sub- 
section is  by  selecting  a maximum  (or  minimum)  element,  which  is 

eliminated  by  masking,  then  the  next  largest,  and  so  on.  The  following 

■ 

program  selects  the  largest  positive  element  in  a fixed  point  matrix  M . , ^ 

masked  by  a boolean  matrix  MASK.  ) 

REGS  [ M MASK  temp  p ] 

/*  Find  the  precision  p from  the  limit  of  M and  set  the 
activity  bits  from  the  mask,  eliminating  negative  values  */ 
p - LIMIT(M);  LDA  MASK;  LDB  MASK;  AND  M'p;  EQU  -MASK; 

STB  MASK;  temp  AOR  -MASK;  temp  % -1;  if  ZE  return;  LDA  MASK;  p-l6 
/*  Now  the  A registers  contain  the  reduced  mask  of  elements  to 
be  scanned.  In  the  next  loop,  the  mask  is  'anded'  with  successive 
bit  planes  in  M */ 

WHILE  CE  DO  (LDB  M'p;  STB  MASK;  temp  AOR  -MASK;  temp  X -1; 

IF  NZ  LDA  MASK;p-16;  RETURN 

I 

i I 

On  return,  MASK  Indicates  by  I's  the  position  of  the  maximum  elements,  if  | 

any.  The  number  of  DPA  cycles  i8^?j^+7  for  each  selection.  Adding  the  time  | 

taken  to  digitise  elements  or  extract  them,  the  complete  sort  takes  about  j 

the  same  time  as  those  mentioned  earlier.  In  many  applications,  however,  I 
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only  Che  maximum  items  are  of  Interest. 


Projection  operations  are  used  to  distribute  data  along  row  or 
column  lines.  The  source  may  be  a scalar  value  in  the  ACU  or  a vector 
selected  from  store.  For  example,  we  can  define  a procedure  ROWP(x,y) 
that  will  project  x if  it  is  numeric  into  every  element  of  the  matrix  y, 
and  if  x is  a matrix  it  will  extract  a vector  by  column  selection  (the 
low  order  address  bits)  and  project  it  by  row  to  y.  The  matrix  multi- 
plication Z :•  X*Y  cakes  Che  form: 

REGS  [X  Y TX  TY  Z N ] 

ROWP  (0,  Z);  DO  (R0WP(X,  TX);C0LP(Y,  TY) ; 

MULT(TX,  TY);ADD(Z,TX);  X'l;  Y'l;  N-1) 

WHILE  GE; 

where  COLP  is  defined  similarly  to  ROUP. 

More  generally,  projection  can  be  based  on  a vector  selected  by 
boolean  matrix  which  specifies  by  I's  a single-valued  boundary  to  be 
used  in  defining  a vector.  The  following  statement  projects  a single 
bit  selected  by  MASK  from  the  matrix  M into  TM  by  row: 

REGS  [M  MASK  temp  /*  a workplane  */  t TM] 

LDB  -MASK;  OP  M;  STB/U  temp;  t AOR  temp;  TM  IR  t 

There  are  four  versions  of  the  code  since  the  ACU  allows  selection  of  Che 
control  vector  by  row  or  column  and,  independently,  projection  by  row  or 
column. 

0 12...  N-1 


An  example  of 
a selection  mask. 
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Sets  of  linear  equations  of  the  form:  Mx  ~ Y may  be  solved  by 
inverting  M,  for  example  by  the  method  of  Causs-Jordan  elimination  given 
later  (p.43)  for  the  DAP  and  then  premultiplying  Y by  M . 

A number  of  techniques  particularly  suited  to  parallel  operation 
have  been  developed  for  the  solution  of  tridiagonal  sets  of  equations.  Here 
the  matrix  M takes  the  form: 


2d 

with  zeros  off  the  diagonals.  In  DPAn  it  is  possible  to  store  2 sets  of 
coefficients  in  vertical  form,  though  the  method  of  reduction  ideally 
requires  m«2^  -1,  so  that  DPA4  would  handle  255  equations,  represented  by 
four  matrices  E,  D,  F,  and  Y. 

The  method  of  cyclic  odd-even  reduction  eliminates  the  unknowns 
of  odd  index  combining  equations,  yielding  a new  tridiagonal 

system  of  size  (m-l)/2.  The  process  is  repeated  until  a single  equation  is 
found,  which  is  then  solved  and  the  remaining  unknowns  found  by  back-sub- 
stitution. 

The  numerical  algorithm  consists  of  eliminating  the  coefficient 
of  x^  ^ in  each  even  numbered  equation  i by  linear  combination  with  equation 
i-1.  Six  multiplications  and  two  additions  are  required,  but  because  a pair 
of  PEs  is  involved  only  three  multiply  and  one  addition  times  are  required. 
The  coefficient  of  is  then  eliminated  using  equation  1-fl,  which  again 

requires  three  multiply  and  one  addition.  At  each  stage  the  number  of 
elements  Is  halved,  therefore  the  data  routing  increases,  but  it  can  be 
seen  that  In  the  elimination  process  there  are  two  sets  of  power-of-two 
shifts,  which  are  repeated  in  back-substitution,  requiring  4N  moves.  In 
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back-subsCituClon  an  equation  of  the  form: 

exi-i  + dx^  + fx^^^  - y 

hasto  be  solved  for  x^,  requiring  two  multiplications,  two  additions  and 
one  division,  which  can  again  be  compressed  by  using  adjacent  PEs  for 
multiplication. 


To  form  an  Idea  of  the  relative  magnitudes  of  arithmetic  and 
routing,  we  may  take  the  mean  time  of  arithmetic  operations  to  be  2(X)|isec 
giving  2.4msec  per  stage,  or  for  N-16,  le  for  255  equations,  eight  stages 
or  20msec.  The  number  of  moves  Is  about  100,  which  requires  under  1msec 
for  32-blt  operands,  le  about  5X  of  the  computation  time. 


The  reader  will  be  able  to  suggest  several  ways  of  speeding  up 
the  procedure:  In  later  stages  of  calculation  It  Is  possible  to  Increase 

parallelism  by  spreading  the  reduction  over  more  PEs;  If  several  sets  of 
equations  are  being  solved  the  reduction  of  one  set  may  be  partly  overlapped 
with  the  back-substitution  of  the  preceding  set;  and  finally  the  numerical 
algorithm  may  converge  before  completing  the  reduction, In  the  sense  that  the 
off-dlagonal  terms  are  all  less  than  a preset  value.  The  solution  of  tri- 
dlagonal  equations  Illustrates  very  clearly  the  way  In  which  numerical,  data 
manipulation  and  programming  skills  can  be  combined  to  make  efficient  use 
of  a DPA. 
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3 EXPERIMENTAL  ARRAYS 


Vta  now  leave  theory  to  examine  three  recent  examples  of  arrays 
of  PEs  with  single-bit  data  paths:  STARAN,  CLIP  and  DAP.  Although  they 
shore  the  same  engineer! ig  technique  the  position  of  the  array  within  the 
system  and  the  organisation  of  software  to  support  parallelism  are  quite 
different  in  each  case.  The  first  two  are  primarily  intended  for  specific 
problem  areas,  namely  aircraft  tracking  (STARAN)  and  image  processing 
(CLIP) , but  they  have  many  features  of  general  applicability  that  I shall 
use  to  Illustrate  alternative  design  approaches.  The  reader  is  referred 
to  the  published  papers  for  further  information. 


3.1  STARAN  I 10] 

The  STARAN  associative  processor  can  be  viewed  as  a control 
memory  shared  by  three  processors:  a PDP-11  host,  an  array  control  unit, 
and  an  array  I-O  controller.  The  function  of  the  host  is  to  handle 
external  comnunlcatlons  and  to  load  array  programs  into  the  control  memory, 
part  of  which  is  fast  (ISO  nsec) , the  remainder  slow  ( Ijisec) . 

Instructions  taken  from  the  control  memory  by  the  ACU  are 
broadcast  toa  linear  array  of  some  multiple  of  256  elements  (in  the  Rome 
Air  Development  Center  configuration  there  are  1024  PEs) . Each  PE  has 
three  single-bit  registers  and  256  bits  of  local  store.  A feature  of 
STARAN  is  the  wide  variety  of  connections  that  can  be  made  between  local 
stores  and  PE  registers,  but  first  its  operation  will  be  described  assuming 
simple  local  store  addressing  as  for  the  DPA. 

The  PE  registers  are  designated  X,  Y and  M.  The  input  (f)  to 
the  ALU  Is  one  of  X,  Y,  N,  s bit  from  the  local  store  (m)  or  a data  bit  (d) 
broadcast  from  the  ACU.  In  arithmetic  instructions  a function  9 is  applied 


f 
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to  either  or  both  of  two  pairs  of  arguments  (X,f)  and  (Y,f),  where  0 is  one 
of  the  sixteen  boolean  functions  of  two  single-bit  arguments.  The  result 
of  0(Y,f)  overwrites  Y.  The  result  of  0(X,f)  overwrites  X either  uncon- 
ditionally or  conditioned  by  the  original  value  of  Y,  le  if  Y»l,  X is  over- 
written, else  X is  unchanged. 

The  above  instructions  can  be  written  in  the  form: 
f 0 g h 

where  g is  the  store  option  on  X,  ie  "X"  meaning  unconditional  write  or 
"X/Y"  meaning  write  conditioned  by  Y;  and  h is  the  store  option  on  Y,  ie 
"Y".  Absence  of  g or  h Implies  that  no  store  takes  place.  Other  array 
functions  are  provided  to  load  H and  to  store  Y either  conditionally  or 
makkdd  by  M,  le  m becomes  (Y.M  -f  m.M) . The  following  example  of  vertical 
addition  is  taken  from  [10]. 

The  problem  is  to  form  the  one-bit  sum  B - A *4*  B. 

/*  Initially  X“0  and  Y is  set  to  the  carry-ln  */ 

1:  A XOR  X/Y  Y 

!*  Now  X-A. CARRY  and  Y-AXCARRY  */ 

2:  B XOR  X/Y  Y 

/*  Now  X contains  the  carry,  Y contains  the  sum  */ 

3:  Y XOR  X B-Y 

4:  X XOR  X Y 

/*  Now  X and  Y are  ready  to  process  the  next  bit  */ 

Thus  serial  addition  takas  4 cycles,  or  SOOnsec  per  bit,  not  counting  the 
ACU  overheads. 

As  already  noted,  local  stores  are  not  directly  connected  to  the 
PEs.  Starting  with  an  array  of  2S6  stores  of  256  bits,  the  stored  pattern 
is  skewed  through  45*  as  shown  in  the  diagram  for  a 4 by  4 array.  The  advan- 
tage gained  is  that  both  rows  and  colomns  of  the  original  array  can  bo 

accessed  as  words  by  suitable  indexing  of  each  column  and  some  potentially 

o 

useful  'hybrid*  coabinatlons  of  rows  and  colomn  can  be  implemented.  In  DPA 
terms,  if  we  think  of  the  data  as  stored  in  horizontal  form  It  can  be  pro- 
cessed serially  by  bit  (256  words  at  a time),  serially  by  byte  (32  bytes  at 
one  time),  and  so  on. 


'50 


■ 


ORIGINAL 


SKEWED 


Second  column: 
Third  row: 
Serial-parallel 


It  will  be  seen  that  the  words  retrieved  generally  need  permuting 
to  appear  in  the  sequence  of  the  original  array,  and  that  is  done  in  a 
separate  'flip*  nettrark.  The  flip  network  is  a permuting  device  that  takes 
data  words  read  from  the  local  store  array  or  from  the  words  of  256  X,  Y or 
M bits  in  the  PEs  and  carries  out  a rotation  or  reversal  on  each  segment  of 
bits  in  the  input  word.  A segment  is  selected  by  program.  It  is  a power 
of  two  (up  to  256  bits)  in  length.  The  output  provides  the  input  f to  each 
PE  in  the  array  instructions.  Thus,  in  the  example  given  above,  to  bring 
the  second  column  into  correspondence  with  the  original  form  we  would  take 
segments  of  two  bits  each  and  reverse  them,  (11  01)  becoming  (01  11)  etc. 


It  is  difficult  to  see  application  for  more  than  a few  of  the 
permutations  permitted  by  STARAN.  The  skewed  form  of  store  is  clearly  help- 
ful in  allowing  a choice  between  vertical  and  horizontal  processing  with- 
out the  need  to  rotate  data  in  the  store  which,  as  we  noted  for  the  DPA, 
takes  about  a multiply  time.  In  a machine  with  very  much  faster  arithmetic, 
such  as  Illiac  IV,  the  ability  to  skew  data  assumes  greater  Importance.  The 
power-of-two  shifta  applied  by  the  flip  network  are  useful  in  many  applica- 
tions, on  the  other  hand  they  are  of  less  Importance  than  arithmetic,  and 
it  could  be  argued  that  faster  operation  and  more  local  store  would  be  a 
better  investment  for  general  purpose  array  work. 


An  extra  facility  that  is  valuable  in  search  procedures  is  the 
detection  in  any  array  of  256  PEs  of  the  index  of  the  first  non-zero  Y bit, 
if  any.  It  is  in  that  sense  that  the  array  can  be  labelled  'associative*. 
Without  ihf  additional  hardware,  eight  mask  and  compare  operations  would  be 
needed  to  develop  the  digits  of  the  index. 
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3.2  CLIP  [11] 

The  processing  of  digitised  pictures  introduces  a class  of  prob- 
lems not  considered  so  far.  If  an  image  is  represented  by  a grid  of  black 
and  white  dots  then  the  recognition  of  the  boundary  of  a two-dimensional 
object  involves  much  more  subtle  neighbour  interaction  than  has  so  far  been 
discussed.  Picture  processing  machines  can  be  thought  of  as  distributed 
processor  arrays  with  a well  developed  means  of  propagating  signals  across 
the  array.  For  example,  to  detect  a closed  boundary  marked  by  I's  we 
could  'flood'  the  array  with  I's  input  at  the  edges  and  allow  the  signal 
to  spread  until  a boundary  is  reached,  then  stop.  The  boundary  points  can 
then  be  marked  by  'anding'  the  original  image  with  the  occupied  cells. 

The  local  PE  connections  assumed  are  usually  6 or  8 neighbour, 
with  progronmied  selection  of  the  rule  of  signal  propagation.  In  a rect- 
angular DPA  any  interconnection  pattern  can  be  programmed  with  the  help  of 
explicit  move  Instructions,  whereas  in  a picture  processor  the  signal  is 
allowed  to  'ripple'  through  the  PEs  (by  analogy  with  carry  propagation  In 
horizontal  mode)  in  a single  instruction  of  variable  duration.  Each  picture 
element  (pixel)  is  mapped  into  the  local  store  of  one  of  the  PEs:  DPA4 
would  represent  in  one  bit  plane  a 16*16  black  and  white  image,  or  d*256 
if  the  pattern  is  stored  vertically.  Larger  pictures,  grey  code  or  colour 
images  would  naturally  require  more  storage.  In  addition,  working  storage 
is  needed  in  each  PE  to  contain  derived  patterns  representing  boundaries, 
internal  regions,  etc. 

In  a typical  image  transformation  a single  bit  in  each  element 
is  designated  the  'output'.  It  is  formed  according  to  the  current  state 
of  the  element,  le  PE  register  values,  aad  Inputs  received  from  selected 
neighbours.  For  example,  the  transformation  rule  written  as: 


10  0 


0 0 0 


can  be  read  'if  in  state  s^  and  the  input  from  the  MW  neighbour  is  1 and 
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all  others  are  zero,  output  1 and  go  to  state  The  transformation  is 

applied  In  parallel  to  all  picture  elements  and  repeated  until  there  Is  no 
change  In  the  output  pattern  for  the  entire  Image  (one  way  to  ensure  termi- 
nation Is  to  allow  only  0-¥l  changes  In  the  output). 

the  above  rule  would  propagate  a diagonal  line  of  I's  from  any  1 Input  on 

the  north  or  west  edge,  until  a cell  not  In  state  s^,  or  In  a 3 by  3 region 

containing  a 1 apart  from  the  NW  corner  Is  encountered. 


For  example.  If 


The  CLIP  array  described  In  [11]  consists  of  16*12  PEs  special- 
ised to  the  type  of  transformation  Just  described.  Each  PE  has  two  single 
bit  working  registers  A and  B,  an  output  N,  and  16  bits  of  local  storage  D. 
There  are  three  types  of  array  Instruction: 


LOAD: 

PROCESS: 


STORE: 


Initialise  A and  B,  using  the  local  store  or  zero  as  Input. 

The  geometric  pattern  (square  or  hexagonal) Is  also  specified. 
Apply  a transformation  rule  using  the  Inputs  N until  there  Is 
no  change  In  N throughout  the  array.  Any  of  the  neighbour 
connections  can  be  selected  and  summed,  then  compared  with 
a threshold  value  t.  The  PE  input  T is  set  to  1 If  the  sum 
exceeds  the  threshold,  else  zero.  The  value  of  N Is  0(Bi^,A) 
where  0 Is  one  of  the  16  boolean  functions  of  two  variables. 
The  PROCESS  Instruction  also  selects  the  edge  Inputs. 

A boolean  function  0'(BvT,A)  Is  evaluated  and  the  result 
written  to  a local  store  plane  or  combined  with  the  current 
value  of  D^  by  'and*  or  'or'  operation. 


FUNCTION  CONTROL 
C|  to  Ca 


MAGE  OUTPUT 

Figure  6:  PE  schematic  for  CLIP-3 
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In  addition  to  the  above  instructions  the  ACU,  which  is  controlled  by  a 
separate  256*24-bit  nenory,  can  execute  subroutine  calls  using  a 16  word 
link  stack,  branch  or  branch  conditional  on  the  AND  of  N outputs  taken 
over  all  192  picture  eleaents.  Provision  is  also  made  to  display  the  A and 
B planes  on  a CRT  so  that  the  effect  of  different  algorithms  can  be  ob- 
served experimentally. 

The  folowing  example  is  taken  from  [11].  Given  an  image  con- 
taining biological  cell  patterns,  it  is  required  to  select  the  outlines  of 
all  cells  containing  nuclei.  Hexagonal  connection  is  assumed,  with  all  six 
inputs  active  and  threshold  zero.  In  the  following  syid>ollc  program  IMAGE, 
OUTPUT,  etc,  refer  to  bit  planes  in  the  local  store  and  the  notation  is 
chosen  to  give  the  flavour  of  the  calculation  rather  than  detailed  instr- 
uction formats.  Figure  7 shows  the  working  results  obtained  after  each 
STORE  instruction. 

/*  From  in  OUTPUT  the  outer  edges  of  objects  in  IMAGE  */ 

1:  LOAD  A-IMA(X:  B-O 

2:  PROCESS  N-(BvT)aA;  Edge  input  ■ 1 

3 : STORE  OUTPUT- (BvT)a  A 

/*  Form  in  GROUND  the  background  surrounding  IMAGE  objects  */ 

4:  LOAD  A-IMAGE;  B-O 

5:  PROCESS  N-(BvT)aA;  Edge  input  - 1 

6:  STORE  GROUND- (BvT)a  A 

I*  Form  in  NUCLEI  the  cell  nuclei.  Propagation  starts  from  the  outer 
edge  in  OUTPUT,  through  1-valued  cells  in  IMAGE  */ 

7t  LOAD  A-IMAGE;  B-OUTPUT 

8:  PROCESS  N-(BvT)aA 

9:  STORE  NUCLEI-(B^)aA 

/*  Form  in  OUTPUT  the  masks  of  cells  with  a nucleus  */ 

10:  LOAD  A-GROUND;  B-NUCLEI 

11:  PROCESS  N-(BvT)aA 

12:  STORE  OUTPUT-(BvT)aA 

/*  Form  in  RESULT  the  nucleated  objects  */ 

13:  LOAD  A-OOTFUT;  B-IMACE 

14:  STORE  RESULT- (BvT)a  A 


Inpjut  1 Input  1 


6 : GROUND  9 ; NUCLEI 
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To  replicate  the  CLIP  Instructions  using  the  DPA  would  clearly 
be  very  time-consuming  (at  a rough  estimate  100  DPA  cycles  would  be  needed 
for  each  evaluation  of  N) , but  In  many  applications  the  full  generality  of 
selection  and  thresholding  Is  not  required.  In  dealing  with  p 'ey  scale  or 
hue  the  proportion  of  arithmetic  operations  will  Increase,  while  In  many 
operations  such  as  thinning,  smoothing,  edge  detection  or  gap  filling  there 
Is  very  little  signal  propagation  In  the  sense  of  the  above  example.  An 
analysis  of  algorithms  in  terms  of  the  frequency  of  PROCESS  Instructions 
and  Che  average  length  of  signal  path  would  be  helpful  In  the  design  of 
special  purpose  versions  of  Che  DPA. 

Having  refined  an  Image  and  separated  the  distinct  'objects' 
it  is  required  to  classify  them  In  some  way.  In  simple  problems  the 
area,  centre  of  gravity  or  moment  of  inertia  may  give  enough  Information 
Cor  classification.  Ocher  problems  will  be  handled  by  representing  the 
Image  as  a graph  which  can  be  transformed  In  Che  ACU  by  list  processing 
techniques.  It  can  be  seen  that  there  Is  no  simple  analog  In  the  DPA  to 
addressing  through  a linked  list.  On  the  other  hand,  analysis  of  problems 
at  a 'higher'  level  frquently  uncovers  new  ways  of  using  parallelism.  One 
of  Che  advantages  of  the  DPA  organisation  Is  chat  sequential  and  parallel 
operations  can  be  applied  without  restriction  to  the  same  data  sets. 


3.3  DAP  [12]  [13] 

f 

The  reader  will  recognise  the  ICL  Distributed  Array  Processor 
as  the  experimental  model  from  which  I have  extracted  the  principles  of 
the  DPA.  It  differs  from  DPA5.2  in  details  of  PE  and  ACU  design. 

In  the  PE  (Figure  8)  the  four  near-neighbour  shifts  are  incor- 
porated into  the  arithmetic  functions,  so  that  it  is  possible  to  take  an 
operand  from  any  of  five  PEs.  The  destination  is  local  and  controlled  by 
the  activity  register  (A)  as  for  the  DPA.  The  A register  can  be  used  in 
its  own  right  for  general  logical  operations,  in  particular  for  combining 
boolean  activity  matrices.  Data  movement  is  carried  out  between  PE 
registers  rather  than  stores.  Provision  is  made  for  ripple  carry  propaga- 
tion in  the  east-west  direction. 

The  resulting  srlthmetlc  speeds  are  shown  in  Table  2.  with  the 
contribution  of  data  and  Instruction  accesses  fin  DAP  each  Instruction 
occiioles  32  bits) . It  can  be  seen  that  desnlte  uslnc  horizontal  carry  the 
vertical  mode  remains  more  effective  when  the  required  level  of  parallelism  ; 

can  be  achieved:  that  is  the  consequence  of  the  carry  propagation  time  and 

1 

the  higher  overhead  on  normalising  shifts  in  horizontal  mode.  The  effect 
of  using  specialised  procedures  for  square  and  square  root  is  apparent  from 
the  times  given. 

I 

The  relatively  low  instruction  access  counts  shown  in  Table  2 j 

are  the  result  of  buffering  in  Che  ACU,  which  is  explicitly  controlled  by  - 

program,  ie  by  a 'DO. . .REPEAT'  construction  which  marks  the  beginning  and  | 

end  of  each  loop.  Within  Che  loop,  instructions  ace  not  only  buffered  but  } 

provision  is  made  to  increment  or  decrement  address  fields  on  each  iteration. 

The  buffering  mechanism  reduces  the  instruction  fetch  overhead  Co  about  lOZ 
on  elementary  arithmetic  and  logic  and  25Z  - 40Z  on  mulclply/dlvlde  and 
floating  point.  Outside  the  loops,  instruction  overhead  is  at  least  lOOZ 
of  data  access.  Where  there  is  high  arithmetic  content,  most  of  the  com- 
putation is  within  loops,  eg  taking  matrix  inversion  (29msec)  and  subracCing 
the  time  for  finding  the  pivot,  add,  multiply  and  divide  leaves  only  about 
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Flgure  8:  Processing  element  schematic  for  DAP 
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TABLE  2:  MEASURED  EXECUTION 

TIMES  FOR 

DAP  32*32  PEs 

MATRIX 

VECTOR 

SCALAR 

All  times  in  ^secs 

(1024) 

(32) 

(1) 

TOTAL 

INSTR 

EFF 

TOTAL  INSTR 

EFF 

TOTAL  INSTR 

EFF 

32-bit  FIXED  POINT 

R P + Q 

23 

3 

.022 

4 

.125 

4 

4. 

R P 

14 

2 

.013 

1 

.031 

R MAX(P,Q) 

34 

2 

.033 

32-blt  FLOATING  POINT 

T X + Y 

148 

26 

.145 

54 

22 

1.69 

27  12 

27. 

T X * Y 

305 

110 

.298 

50 

10 

1.56 

34  14 

34. 

T X / Y 

390 

120 

.381 

100 

20 

3.13 

T X **  2 

155 

60 

.152 

40 

10 

1.25 

T SQRT(X) 

215 

70 

.210 

SCALAR-MATRIX 

X S*Y  mil 

ti  40 

10 

.039 

Note 

: 'EFF'  is  the  effective 

max  150 

50 

.146 

time  for  single 

oper- 

S 2-  SUM(X) 

165 

10 

.161 

ands,  ie  TOTAL/parallel] 

S MAX(X) 

46 

2 

.045 

data  streams. 

MATRIX  OPERATIONS 

MULTIPLY(X,Y) 

16msec 

INVERT(X) 

29msec 

[All 

1024  element  single  precision 

FFT(X) 

14msec 

floating  point  arrays] 

I 


100  organisation  Instructions  on  each  iteration.  Instruction  fetch  overhead 
is  reduced  in  larger  arrays  and  could  be  eliminated  by  using  separate  control 
storage:  the  engineering  trade-offs  are  essentially  the  same  as  for  microcode. 

The  Fast  Fourier  Transform  algorithm  is  often  used  to  justify 
additional  routing  capability.  In  DAP  it  is  applied  to  an  array  of  1024 
complex  values  or  to  a two-dimensional  32*32  array,  in  each  case  in  vertical 
mode.  For  2^^  variables,  2n  parallel  computing  steps  are  required.  The 
routing  pattern  for  n"4  is  shown  in  Figure  9.  In  general,  the  first  step 
can  be  carried  out  in  one  cyclic  shift,  the  remainder  need  two  shifts 
each.  Using  orthogonal  connections,  the  number  of  complex  moves  is  3*2*'+2"  ^-4. 
After  completing  the  transformation  a second  series  of  n shifts  is  required 
to  return  the  elements  to  their  original  positions.  The  total  number  of 
moves  is  again  proportional  to  2*'.  A DAP  program,  taking  advantage  of  the 


■ 2 3_  i 
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-4 

8 
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Figure  9:  Data  routing  In  the  FFT  for  n»2  (2  ° variables) 


slnple  form  of  multlplers  In  the  early  stages  of  odculatlon,  but  deriving 
successive  mutlpllers  by  a recurrence  relation,  has  the  following  contributing 
factors: 


Count 

Time 

Total 

Multiplication  (32  bit  fl.pt.) 

32 

305  |isec 

9.76  msec 

Addition  (32  bit  fl.pt) 

38 

148 

5.62 

Assignment 

90 

15 

1.35 

Routing 

216 

7 

1.51 

Compute  multipliers 

16 

500 

8.00 

Subtotal 

26.24  msec 

Reshuffle 

286 

7 

2.18 

Total 

28.4  aisec 

(In  the  first  step  A1  is  calculated:  A1(0)  Is  the  sum  of  A(0)  and  the 
product  of  a complex  multiplier  (in  this  case  1)  with  A(8) ; In  the  next  step 
A2(0)  Is  the  sum  of  A1(0)  and  a multiple  of  Al(4) , and  so  on.  After  four 
steps  the  original  A(15}  has  been  routed  to  contribute  to  A4(0)  and  A(0)  to 
A4(15)). 


From  the  above  figures  it  can  be  seen  that  routing  Is  not  a major 
factor  for  vectors  of  size  1024.  For  arrays  of  4096  PEs  the  ' routing 

overhead  doubles  whereas  the  computation  Increases  by  two  steps,.  so  we  must 
be  cautious  In  drawing  general  conclusions.  The  FFT  time  given  In  Table  2 Is 
the  result  of  using  coding  tricks  to  reduce  the  arithmetic  content,  with  the 
result  that  routing  occupies  the  DAP  for  about  25Z  of  the  FFT  procedure. 


-42- 


The  main  difference  between  the  DAP  and  DPA  Is  the  role  of  the 
array  in  system:  the  ACU-DPA  could  be  regarded  as  a stand-alone  processor- 
memory  pair  or  a node  In  a distributed  system,  whereas  the  DAP  is  seen  as 
a substitute  for  a main  store  module  In  a conventional  centralised  system. 

The  control  unit  of  the  DAP  is  concerned  only  with  issuing  array  instruc- 
tions and  serving  requests  received  over  the  main  store  data  and  address 
lines  (Figure  2,  page  7).  In  DPA  terms  the  'main  store  data  and  address 
lines'  could  be  replaced  by  'Interprocessor  bus'. 

The  DAP  Is  therefore  a componenet  of  a larger  system,  in  which 
the  host  processor  takes  responsibility  for  store  management  and  DAP 
scheduling  and  provides  all  necessary  support  functions.  Tasks  are  Issued 
to  the  DAP  in  the  form  of  *'DAP  segments'  containing  all  necessary  programs 
and  data.  The  DAP  operates  in  parallel  with  the  host,  serving  external 
requests  by  Interrupt  processing  and  able  to  Interrupt  the  host  on  task 
completion.  It  Is  prevented  from  overwriting  store  outside  the  current 
segment  by  setting  base  and  limit  registers.  Although  scalar  operations 
can  be  carried  out  in  the  DAP  (Table  2)  the  effect  of  such  an  organisation 
is  to  concentrate  parallel  phases  of  computation  Into  'DAP  subroutines' 
and  to  leave  the  rest  to  the  host.  Because  of  the  overhead  In  forming  a 
DAP  segment  and  scheduling  Its  use  there  Is  a lower  limit  of  complexity 
In  what  is  worth  considering  as  a DAP  subroutine,  eg  we  would  not  use  the 
DAP  for  looking  up  a single  word  In  a dictionary,  which  would  be  natural 
for  the  DPA.  The  difficulty  might  be  overcome  by  'batching'  requests  for 
elementary  operations,  but  that  tends  to  complicate  software  design. 

The  design  of  DAP  subroutines  has  followed  much  the  same  lines 
as  Illlac  IV,  for  much  the  same  reasons:  a macroassembler  for  basic  software 
and  a Fortran-based  higher  level  language.  Purists  may  think  that  a retro- 
grade step,  but  at  the  present  stage  of  development  It  la  Important  to 
have  precise  control  of  store  allocation  and  alignment  as  well  as  processor 
synchronisation,  protection  and  error  management.  At  some  future  date,  when 
bit  planes  are  more  plentiful,  we  can  afford  to  be  more  adventurous. 

Figure  10  Is  an  example  of  a DAP-Fortran  subroutine  taken  from  [13] 
with  explanation  of  the  conventions  used  in  Indexing. 
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01  C DECLARATIONS 

02  SUBROUTINE  INVP(A) 

03  REAL  A(,),  B(,) 

04  LOGICAL  PROW(,),  PCOL(,),  PMASK(,),  PIVOT(,) ,MASK(,) , PIV0TS(,) 

05  INTEGER  RN() 

06  C NOTE  THAT  THE  ARRAY  DIMENSIONS  ARE  IMPLICITLY  GIVEN  BY  THE 

07  C SIZE  OF  THE  DAP.  A AND  B ARE  REAL  SINGLE  PRECISION  MATRICES 

08  C IN  VERTICAL  FORM,  RN  IS  A VECTOR  AND  PROW,  PCOL  ETC  ARE 

09  C BOOLEAN  MATRICES. 

10 

11  C INITIALISE  MASK  TO  CONTROL  SEARCH  FOR  PIVOT  ELEMENT 

12  C AND  PIVOTS  TO  MARK  THOSE  PIVOTS  ALREADY  USED 

13  MASK  - .TRUE. 

14  PIVOTS  - .FALSE. 

15 

16  C MAIN  ITERATION 

17  C FRST,  MAXL  AND  ABS  ARE  INTRINSIC  MATRIX  FUNCTIONS 

18  C EG  MAXL  FINDS  THE  MAXIMUM  ELEMENT(S)  IN  AN  ARRAY  UNDER  A 

19  C SPECIFIED  MASK 

20  DO  1 K - l.DAPSIZE 

21  PIVOT  » FRST(MAXL(ABS(A) , MASK)) 

22  S - A(PIVOT) 

23  PIVOTS  “ PIVOT  .OR.  PIVOT 

24  PROW  - BYROW(ORR(PIVOT)) 

25  PCOL  - BYCOL(ORC(PIVOT)) 

26  PMASK  - .NOT. (PROW  .OR.  PCOL) 

27  C BYROW, BYCOL  ARE  PROJECTION  FUNCTIONS 

28  C ORR,  ORC  FORM  BOOLEAN  VECTORS  BY  "OR"  OF  ROW,  COLUMN 

29  A(PIVOT)  - 1.0 

30  A - MERCE(A,  0.0,  PMASK)  - A( ,*PCOL)*BYCOL(A(PROW)/S) 

31  PROW  » -A 

32  1 MASK  “ MASK. AND.  PMASK 

33  C NOTE  THE  USE  OF  MATRIX  INDEX  IN  29  AND  PROJECTIONS  IN  30 

34 

35  C THE  FINAL  STATEMENTS  RESHUFFLE  ROWS  AND  COLUMNS 

36  RN  - ROWN (PIVOTS) 

37  DO  2,  K - l.DAPSIZE 

38  2 B(K,)  - A(RN(K),) 

39  DO  3,  K - l.DAPSIZE 

40  3 A(,RN(K))  - B(,K) 

41  RETURN 

42  END 
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FlRure  10:  DAP-FORTRAN  subroutine  for  matrix  inversi'^n 

In  each  iteration  the  largest  pivot  element  A(p,q)  is  found  and  used 
to  compute  the  new  values: 


row  p: 
col  q: 
pivot: 


Ad.j)  A(i.j)  - A(i,q)  * A(p,J)  / A(p,q) 

A(p,J)  A(p,j)  / A(p,q) 

A(l,q)  -A(i,q)  / A(p,q) 

A(p,q)  1 / A(p,q) 


SYSTEM  DESIGN 


Table  3 gives  some  Idea  of  the  DAP  perforaance  relative  to  other 
'high  speed'  ■achlnes.  All  the  tines  are  experlnentally  neasured  with  the 
exception  of  DAP-*FORTRAN , which  la  estlnated  by  doubling  the  control  over- 
head (llnabcO  of  the  assembler.  The  corresponding  tine  for  a 64*64  DAP 
would  be  about  30nsec.  Using  double  precision  floating  point  we  expect 
nultlpllcatlon  to  Increase  as  the  square  of  the  length  of  fraction,  and 
addition  to  bo  linear,  la  from  Table  2: 


■)‘  * 195  + (110/2)  - 1117 


Multiply: 


* 122  + (26/2) 


Hence  matrix  multiply  Inc: 


to  about  90msec 


TABLE  3:  RELATIVE  PERFORMANCE  MEASURES 
MATRIX  MULTIPLY 


64*64  arrays 


Machlna 


Precision 


Assembler 


ILLIAC  IV 

CDC  7600 

IBM  360/195 

ICL  DAP  (32*32  PEs) 


64  bits 
60  bits 
64  bits 
32  bits 


168 

110 

(e8t)140 


Many  factors  have  to  be  taken  Into  account  In  estimating  relative 
performance  over  complete  applications,  but  although  It  will  be  argued  that 
I have  chosen  the  most  favourable  possible  cxaparlson  for  array  processors 
It  Is  certainly  not  the  case  that  comparisons  get  progressively  worse  from 
the  point  of  view  of  the  DAP:  In  luny  major  appllcatloiM  a high  degree 
of  parallelism  can  be  extracted  by  careful  program  analysis.  The  cost  of 
doing  so  Is  no  more  than  a sequantlal  machine  would  require,  once  conventions 
have  been  established  to  facilitate  thinking  In  array  terms.  However,  the 
most  vital  statistic  that  might  have  been  added  to  Table  3 Is  that  the  DAP 
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uses  less  than  100  000  TTL  gates  for  the  entire  ACU  and  PE  logic,  whereas 
all  the  others  use  upwards  of  1 000  000  fast  ECL  gates. 

In  the  first  lecture  I said  that  we  were  looking  for  a TX  Im- 
provement In  system  throughput*  for  an  Investment  of  substantially  less 
than  TZ.  Now  system  throughput  Is  largely  determined  by  the  rate  at 
which  tasks  are  executed  and  the  time  remaining  after  the  operating 
system  has  completed  its  business  of  compiling,  loading,  scheduling, 
table  maintenance,  archiving,  spooling,  etc.  In  this  lecture  I shall 
examine  ways  In  which  the  presence  of  a DPA  oilght  affect  this  negative 
contribution  to  throughput  by  the  operating  system.  Many  applications 
suoh  as  searching.  Indexing  and  encryption  come  to  mind.  However,  It 
might  be  said  that  In  all  but  pathological  cases  the  net  system  over- 
head Is  only  a few  tens  of  percent  of  real  time  and  that  imposes  a limit 
on  potential  improvements.  My  belief  Is  that  In  system  design,  as  In 
other  application  areas,  the  preferred  approach  is  to  start  with  a restata- 
mant  of  objectives  that  allows  tbs  array  to  Influence  subsequent  problem 
*n*ly*ls.  The  subsections  that  follow  Illustrate  how  resource  ownagement, 
program  context,  and  higher  level  connectivity  are  Influenced.  But  let  us 
first  obtain  an  estimate  for  the  other  side  of  the  Inequality:  the  Invest- 
ment In  extra  hardware  implied  by  a DPA. 

The  32*32  DAP  Is  made  from  standard  TTL  dual-ln-llne  Integrated 
circuits  (DILlCs)  mounted  on  boards  with  about  100  package  positions.  In 
the  Initial  design  there  are  16  PEs  to  a board,  averaging  3.6  DILICS  plus 
two  IKblt  store  DlLICs  each.  Hence,  to  the  extent  that  hardware  cost  Is 
determined  by  package  count  the  PE  logic  represents  more  than  half  tha 
board  space,  compared  with  purely  pasalve  store  (the  same  boards  used  as 
control  memory  provide  64Kblts  of  storage). 

To  Improve  on  that  picture  we  must  follow  up  the  original  inten- 
tion of  using  custom-built  LSI  for  the  PEs.  For  the  purpose  of  making  com- 
parisons an  'exchange  rate*  has  to  be  fixed  between  the  PE  logic  and  storage 
bits,  which  I shall  taka  to  be  128bits (bipolar)  ■ S12blts(N0S)  ■ one  PE,  baaed 
on  approximately  50  gates/PE  In  the  DPA  design.  We  sssubm  that  at  any  point 
In  time  the  PE  array  will  be  subject  to  the  same  level  of  Integration  as  the 


stores. 
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The  limiting  factors  are  the  complexity  of  circuit  and  the  .lumVer 
of  edge  connections  required.  For  example,  a 4*4  array  of  PE  logic  is 
equivalent  to  2Kblt8  of  bipolar  storage  so  it  is  well  within  the  range  of 
current  LSI  devices.  For  such  a package  16  bidirectional  data  connections 
are  needed  to  give  row,  column  and  neighbour  I-O  under  control  of  3 function 
bits  (data  lines  are  also  used  for  control  signals) . The  addition  of  parity 
(4bits) , voltage  (2),clock(l)  and  store  write-enables  (16)  brings  the  pin 
count  up  to  42.  An  alternative  is  to  Integrate  part  of  the  local  store  with 
the  PE  array:  SKbits  of  fast  storage  in  addition  to  the  PEs  requires  more 
advanced  technology,  but  the  16  write-enable  outputs  are  replaced  by  9 
address  blt8(S12bits/PE) , allowing  greater  freedom  in  pin  allocation^  le 
using  more  function  inputs  and  relying  less  on  decoding  in  the  device. 


Whether  or  not  the  local  memory  is  integrated  with  the  PEs  it 
is  likely  that  in  future  designs  the  DPA  will  be  enlarged  by  the  addition 
slow  (MOS)  storage  to  the  array.  A possible  configuration  would  be  DPA4.1 
(with  32Kbytes  of  fast  store)  and  an  additional  16Kbits  of  slow  store  for 
each  PE.  The  fast  memory  is  now  a slave  or  cache  for  the  'main  store*  of 
-^Mbyte:  a bit  plane  (32  bytes)  can  be  accessed  in  one  memory  reference 
(say  400n8ec) , the  theoretical  transfer  rate  being  about  75  Mbytes/sec.  A 
number  of  architectural  questions  need  to  be  answered  before  one  can  pick 
the  'best*  configuration,  but  a useful  comparison  can  be  made  between  the 
enhanced  DPA4.1  and  a conventional  system  with  ^Mbyte  of  main  memory  and 
32  Kbytes  of  fast  stores  of  one  sort  or  another:  the  PE  logic  is  equivalent 
to  adding  another  4. Kbytes  of  fast  stored  and  it  is  that  figure  together 
with  the  LSI  development  cost,  seen  as  a percentage  of  the  total  system, 
which  determines  the  TZ  investment  I assumed  initially. 


The  following  subsections  continue  to  use  DPA4  as  a model  for 
discussion,  but  it  should  be  clear  where  proportionate  Increases  in  cost 
or  performance  can  be  expected  from  larger  systems. 

* in  terms  of  logic  or  power,  it  is  more  (about  Ibkbytes)  in  terms  of 
board  space,  so  the  true  figure  is  perhaps  in  the  region  of  lOkbytes. 
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4.1  Index  management 

This  subsection  Is  concerne d with  the  management  of  'objects 
of  computation'.  In  particular  with  the  problem  the  DPA  creates  for  Itself 
by  having  two  levels  of  program  storage.  The  topic  Is  Important  In  the 
design  of  operating  systems  because  the  most  effective  way  of  keeping 
control  of  complex  software  structures  Is  to  express  their  procedures  In 
terms  of  abstract  objects  (such  as  files,  processes,  stacks)  whose  Integrity 
la  preserved  by  protection  mechanisms.  In  static  environments  much  of  the 
protection  - allocation,  type  checking,  etc  - can  be  done  at  compile  or 
load  time,  but  In-line  control  is  necessary  for  changing  data  structures. 

Of  the  two  methods  of  control  used  In  practice,  ie  capability  and  access 
control  list,  the  former  provide  the  most  precise  and  efficient  treatment. 
Tagged  registers,  such  as  those  of  the  ACU  (page  12)  are  the  most  flexible 
way  of  handling  capabilities. 

The  general  system  objective  Is  as  follows:  given  a set  of 

object  types  we  need  to  create  instances  of  objects  and  to  assign  attributes 
to  them;  to  grant  and  revoke  access  on  a selective  basis;  and  to  remove 
objects  from  the  program  space  when  they  are  no  longer  required.  A capability 
Identifies  an  object  u of  type  t and  rights  r by  encoding  it  as  a tagged 
element,  eg  In  ACU4: 

^ 4 8 16 

|t.t|t|  t]  u I 

where  t and  r are  taken  care  of  by  a combination  of  hard  and  soft  Inter- 
pretation. Our  main  concern  Is  with  the  choice  of  u,  which  la  either  a store 
Index  (ie  a location  number)  or  an  index  in  a 'master  object  table'  for 
type  t^.  The  difficulty  Is  that  the  index  u cannot  be  re-used  until  all 
capabilities  containing  u have  been  annulled,  and  In  that  sense  the  manage- 
ment of  abstract  objects  can  be  viewed  as  the  management  of  a small  number 
of  'index  spaces'  where  the  Indices  are  spread  over  some  fraction  of  the 
total  program  space. 


-48- 


If  we  plot  the  occupancy  of  a master  object  table  we  see  that  it 
Increases  with  time  (at  an  average  rate  r Indices/second)  until  the  table  is 
full,  at  which  point  recovery  procedures  are  Invoked  to  create  a new  'free 
index  list'.  If  R is  the  number  of  indices  recovered  then  recovery  takes 
place  after  p/i-  seconds. 


table! 

OCCUPANCY 


The  recovery  process  involves  scanning  all  capability-bearing 
regions  of  store.  The  criterion  for  recovering  an  index  may  be  that  the 
reference  count  is  zero,  or  that  an  explicit  'deletion'  operation  has 
been  applied.  The  normal  procedure  in  either  case  is  to  take  each  capability 
of  class  t^  and  compare  it  with  table  entry  M^(u),  marking  the  table  or  the 
capability  as  appropriate.  If  the  total  program  store  is  K bytes  and  the 
proportion  that  has  to  be  scanned  is  p then  the  recovery  time  is  linearly 
related  to  pK/T  and  pKC,  where  T is  the  rate  of  scan  and  C is  the  probability 
of  finding  a capability  of  the  given  type.  The  time  wasted  in  index  manage- 
ment is  expressed  as  a proportion  of  computing  time  by  the  ratio: 

W - Y + C ) 

The  normal  methods  used  to  reduce  W Include: 

(a)  Increasing  R,  eg  using  virtual  Indices  in  the  case  of  store  access; 

(b)  restricting  p by  limiting  the  nuiid>er  and  size  of  capability-bearing 

segments ; 

(c)  partitioning  the  program  space  according  to  process  number,  so  that 
smaller  regions  are  scanned  (and  the  cost  can  be  transferred  to  the 
process) ; 

(d)  reducing  r by  requiring  logically  distinct  objects  to  be  mapped  into 
the  same  object  space  (so  defeating  one  of  the  aims  of  abstraction) . 

The  effect  of  DPAn  is  to  increase  the  nominal  rate  of  scan,  T,  by  a factor 
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2'*,  80  chat  the  first  term  of  W is  correspondingly  reduced.  It  appears  that 
the  individual  comparisons  with  have  to  be  done  sequentially,  so  the 
second  term  is  unchanged  and  Che  benefit  of  the  DPA  will  be  most  marked  for 
object  types  of  fairly  low  population.  If  W is  already  small  this  will  be 
seen  not  as  an  Increase  in  throughput  but  as  a change  of  program  style:  Che 
measures  that  restricted  p and  r can  be  relaxed  without  affecting  performance. 

Storage  is  associated  with  a relatively  high  value  of  C (in  Che 
Basic  Language  Machine  about  20Z  of  the  elements  in  the  stack  are  addresses) . 
Storage  also  brings  the  complication  of  assigning  indices  in  blocks  rather 
than  as  single  values,  giving  rise  to  Che  management  Casks  of  searching  for 
a block  of  Che  required  size  and  compacting  to  provide  the  maximum  free 
block.  In  both  operations  the  DPA  can  be  expected  to  reduce  system  over- 
heads by  direct  application  of  parallel  search  and  relocation  procedures. 

In  that  sense  the  presence  of  a DPA,  which  we  equated  earlier  with  Che 
addition  of  a small  amount  of  fast  store,  may  in  fact  reduce  overall  store 
requirements:  score  can  be  allocated  in  smaller  units,  Che  resulting 
structures  can  be  managed  effectively  in  less  space,  and  the  need  for 
remapping  from  virtual  to  real  indices  is  practically  eliminated. 

Modifications  to  Che  addressing  mechanism  Co  allow  access  to 
slow  storage  have  not  been  studied  in  detail.  Two  possibilities  can  be 
suggested.  In  ACU4,  with  16  bit  location  fields,  there  is  not  range  enough 
Co  cover  Che  slow  score,  cherefore  a new  Cype  of  address  Is  incroduced, 
resolving  to  Che  bic  plane  boundary.  The  DPA  arlchmetic  functions  apply 
to  'slow'  addresses,  but  the  only  routing  operations  are  unselecCive  store 
(STA/U  and  STB/U) . The  slow  store  provides  a 'segment  space'  holding  the 
data  and  procedures  of  all  programs,  which  will  be  mapped  into  fast  store 
under  control  of  low  level  InterpreCive  code.  It  follows  Chat  the  fast 
score  must  be  large  enough  to  contain  all  Che  'working  segments'  of  all 
active  processes.  The  program  store  K is  less  chan  32KbyCes  in  DPA4.1,  and 
the  proportion  of  address-bearing  segments  p is  not  usually  more  than  lOZ. 

The  time  of  scan  is  therefore  very  short  in  absolute  terms.  For  all  other 
indices  Che  entire  program  space  of  up  to  -^Ibyte  is  available  to  K. 

An  alternative  strategy  is  to  use  a longer  address  word,  eg  a 32 
bit  location  ntmiber  in  ACU6,  Chat  can  cover  Che  entire  slow  storage  range. 
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Suppose  we  have  an  extended  DPA6  with  16Kblt  slow  stores  and  512blt  inte- 
grated fast  store.  The  loading  rule  is  to  allocate  a bit  plane  to  fast 
memory  only  when  Its  address  is  formed  in  an  ACU  register.  There  is  only 
one  type  of  address,  which  contains  the  location  in  slow  memory,  but  when 
the  corresponding  plane  is  paged  into  fast  memory  its  plane  number  is 
added  to  the  address: 

9 14  9 

I fast  planej  slow  plane  | byte  Q | 

Consequently,  all  memory  references  by  the  ACU  are  to  the  fast  store.  The 
address  is  not  updated  until  the  plane  number  changes  as  the  result  of 
modification.  A 512  entry  associative  store  is  needed  to  translate  from 
alow  to  fast  plane  numbers:  the  DFA  can  perform  that  function,  though  in 
a high  performance  system  specialised  stores  are  probably  justified.  One 
purging  strategy  is  to  write  back  to  slow  store  all  planes  that  are  not 
write-protected,  and  to  scan  addresses  to  clear  the  fast  plane  numbers  and 
force  reloading.  Here  K is  SMbytes,  and  assuming  lOZ  address-bearing  up 
to  1600  planes  have  to  be  scanned. 


Figure  11  Illustrates  the  two  methods  of  addressing.  It  is 
probable  that  the  second  is  as  effective  as  the  first  although  it  uses  less 
fast  store. 


Figure  11:  Two  methods  of  addressing  a large  program  space 


-51- 


4.2  Program  context 

The  methods  just  outlined  provide  a rapid  means  of  adjusting 
the  content  of  fast  memory  to  meet  program  requirements.  They  use  the 
high  bandwidth  between  slow  and  fast  stores  and  the  predictive  property 
of  tagged  addresses,  but  not  the  arithmetic  or  logical  functions  of  the 
PEs.  The  effect  of  the  DPA  In  system  Is  thus  comparable  with  other  slave 
mcuBorles,  given  the  page  size  fixed  by  the  bit  plane:  advantage  Is  taken 
of  locality  of  reference  to  data,  data  descriptors  and  Instructions.  If 
the  Instruction  takes  the  form  of  a language-oriented  token  string  there 
Is  no  need  to  rediscover  the  locality  because  It  Is  already  explicit. 

The  DPA  Is  well  adapted  to  Interpretlveprogrammlng  techniques.  In  part- 
icular, It  solves  the  system  problem  of  providing  fast  access  to  (micro) 
Instructions  without  having  a dedicated  control  memory. 

We  now  consider  using  the  associative  function  of  the  DPA  In 
conjunction  with  program  design.  The  most  suitable  areas  of  application 
are  the  Interfaces  between  control  modules  and  between  procedures.  A 
control  module  Is  a segment  of  Instructions,  data  and  free  variables  [F^] 
that  Is  Intended  for  execution  In  any  environment  providing  suitable 
definitions  of  the  [F^].  An  environment  Is  a list  of  Identifier-value  pairs 
[(C^,v^)],  and  the  execution  requirement  Is  solved  In  principle  by  looking 
up  the  value  of  each  F^  In  the  list  [G^]  and  assigning  to  F^  the  value  v^ 
when  a match  Is  found.  In  general,  a control  module  can  be  In  simul- 
taneous execution  with  respect  to  a number  of  partly  overlapping  environ- 
ments, whose  order  Is  unknown  when  the  module  Is  constructed,  and  It  Is 
that  which  prevents  a simple  reference  by  Index  value. 

A commonly  used  solution  In  virtual  memory  systems  Is  to 
Identify  the  free  variables  with  segments  and  to  partition  the  segment 
space  Into  'system*  , 'public'  and 'private* domains  tihose  structure  Is 
known  by  load  time.  In  that  case  the  [F^]  can  be  replaced  by  segment 
indices.  Another  approach  Is  to  carry  out  the  association  of  [F^]  with 
the  {C^J  In  each  environment  of  Interest  and  to  store  the  results  In 
tables  that  are  referenced  Indirectly  via  process-dependent  addresses. 
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The  disadvantage  to  such  techniques  Is  that  they  Impose  unnecessary  struc- 
ture on  programs. 

The  DPA  allows  a return  to  the  direct  and  elegant  solution: 
the  Identifiers  are  retained  In  the  control  module  (possibly  in  coded 
form)  and  associated  In  parallel  with  the  set  [G^I . An  additional  opera- 
tion that  Is  Important  In  some  protection  regimes  Is  to  check  the  name  of 
the  calling  module  against  an  access  control  list  for  the  callee:  that  can 
also  be  done  In  parallel. 

At  the  procedure  Interface  similar  options  apply,  except  that 
the  Identifier-value  list  is  formed  when  executing  the  calling  sequence. 

The  effect  would  be  to  allow  parameters  to  be  called  'by  identifier*. 
Although  such  a facility  is  attractive  in  some  applications  it  is  unlikely 
to  replace  the  conventional  method  of  Indexing  relative  to  a parameter 
pointer.  A more  general  approach  might  be  to  introduce  a class  of  abstract 
data  types  of  the  form  'identifier-value  list',  which  could  be  maintained 
efficiently  by  the  DPA. 
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4.3  High  level  connections 

The  I-O  subsystem  is  a major  part  of  Illlac  IV,  STARAN  and  CLIP, 
yet  it  has  not  featured  in  DAP  or  the  general  discussion  of  DPA's.  The 
primary  I-O  channel  for  the  DPA  is  the  main  store  highway,  the  maximum 
transfer  rate  via  the  ACU  being  SMwords/second,  the  sustained  rate  naturally 
depending  on  the  bus  capacity.  Assuming  IMword/sec  and  using  DPA6,  a matrix 
of  32  bit  planes  can  be  input  in  2msec  or  about  10  floating  point  operation 
times.  To  a first  approximation  we  can  derive  conditions  on  any  application 
for  'balanced*  computation  and  data  flow.  As  the  capacity  of  local  storage 
devices  Increases  so  does  the  range  of  problems  that  can  be  contained  wholly 
in  the  array;  for  example,  the  addition  of  a 64Kblt  serial  CCD  store  to 
each  PE  in  DPA6  would  extend  the  internal  storage  to  40Mbyte,  which  gives 
the  PEs  quite  a lot  to  work  on. 

Higher  I-O  rates  are  required  in  the  context  of  increasing  the 
processing  power  by  increasing  the  'area'  presented  to  the  PEs  by  each  bit 
plane.  To  achieve  a theoretical  rate  of  1000 .MOPS  we  have  to  progress  to 
DPA9  or  to  array  of  smaller  DPA's,  say  64  DPA6's.  In  either  case,  routing 
overheads  may  be  significant  unless  new  data  paths  are  Introduced.  The 
second  alternative  is  attractive  because  the  64  ACU's  can  work  Independently 
to  achieve  a higher  effective  PE  utilisation,  and  given  suitable  connection 
paths  reconfiguration  can  be  used  to  suit  problem  geometry  or  avoid  faulty 
DPA's.  Further  research  is  needed  in  this  area.  A connection  scheme  using 
two  orthogonal  sets  of  8 data  busses  is  shown  in  Figure  12.  Being  time- 
multiplexed,  a move  takes  at  least  eight  times  as  long  as  it  does  inside 
the  DPA;  however,  in  a single  operation  there  is  the  choice  of  1,65,...  or 
449  column  or  row  steps.  The  use  of  direct  memory  access  to  the  slow  local 
stores  would  allow  routing  to  be  overlapped  with  computation.  Within  this 
general  framework  any  of  the  DPA's  may  be  replaced  by  a conventional  pro- 
cessor, a large  capacity  store,  or  an  I-O  channel  controller. 
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